cms/converters.py - Issue 29516687: Issue 4488 - Add support for JSON page front matter

Side by Side Diff: cms/converters.py

Issue 29516687: Issue 4488 - Add support for JSON page front matter (Closed) Base URL: https://hg.adblockplus.org/cms

Patch Set: Cleaned up duplication, removed unnecessary regex Created Aug. 23, 2017, 6:12 p.m.

Left:
Right:

Use n/p to move between diff chunks; N/P to move between comments.

Jump to:

View unified diff | Download patch

OLD	NEW
1 # This file is part of the Adblock Plus web scripts,	1 # This file is part of the Adblock Plus web scripts,

2 # Copyright (C) 2006-present eyeo GmbH	2 # Copyright (C) 2006-present eyeo GmbH

3 #	3 #

4 # Adblock Plus is free software: you can redistribute it and/or modify	4 # Adblock Plus is free software: you can redistribute it and/or modify

5 # it under the terms of the GNU General Public License version 3 as	5 # it under the terms of the GNU General Public License version 3 as

6 # published by the Free Software Foundation.	6 # published by the Free Software Foundation.

7 #	7 #

8 # Adblock Plus is distributed in the hope that it will be useful,	8 # Adblock Plus is distributed in the hope that it will be useful,

9 # but WITHOUT ANY WARRANTY; without even the implied warranty of	9 # but WITHOUT ANY WARRANTY; without even the implied warranty of

10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the	10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

11 # GNU General Public License for more details.	11 # GNU General Public License for more details.

12 #	12 #

13 # You should have received a copy of the GNU General Public License	13 # You should have received a copy of the GNU General Public License

14 # along with Adblock Plus. If not, see <http://www.gnu.org/licenses/>.	14 # along with Adblock Plus. If not, see <http://www.gnu.org/licenses/>.

15	15

16 from __future__ import unicode_literals	16 from __future__ import unicode_literals

17	17

18 import os	18 import os

19 import HTMLParser	19 import HTMLParser

20 import re	20 import re

21 import urlparse	21 import urlparse

	22 import json

22	23

23 import jinja2	24 import jinja2

24 import markdown	25 import markdown

25	26

26	27

27 # Monkey-patch Markdown's isBlockLevel function to ensure that no paragraphs	28 # Monkey-patch Markdown's isBlockLevel function to ensure that no paragraphs

28 # are inserted into the <head> tag	29 # are inserted into the <head> tag

29 orig_isBlockLevel = markdown.util.isBlockLevel	30 orig_isBlockLevel = markdown.util.isBlockLevel

30	31

31	32

32 def isBlockLevel(tag):	33 def isBlockLevel(tag):

33 if tag == 'head':	34 if tag == 'head':

34 return True	35 return True

35 return orig_isBlockLevel(tag)	36 return orig_isBlockLevel(tag)

36	37

	38
	Sebastian Noack 2017/08/29 22:49:39 Adding this blank line is unrelated. Adding this blank line is unrelated. rosie 2018/03/26 02:32:21 True. My linter was showing a warning because ther Show quoted text On 2017/08/29 22:49:39, Sebastian Noack wrote: > Adding this blank line is unrelated. True. My linter was showing a warning because there were not two blank lines surrounding the function. Should I change it back? Sebastian Noack 2018/03/26 02:57:47 Strictly, there should be two blank lines surround Show quoted text On 2018/03/26 02:32:21, rosie wrote: > On 2017/08/29 22:49:39, Sebastian Noack wrote: > > Adding this blank line is unrelated. > > True. My linter was showing a warning because there were not two blank lines > surrounding the function. Should I change it back? Strictly, there should be two blank lines surrounding each function. Flake8 will complain about this as well, however, if you run flake8 trough tox, our configuration ignores this error for this file. That doesn't mean that we disagree to the linter error, but merely didn't care to change all of our code at once, when we started using flake8. Anyway, it seems this is no longer relevant, as the lastest patch doesn't touch this file at all, anymore.
37 markdown.util.isBlockLevel = isBlockLevel	39 markdown.util.isBlockLevel = isBlockLevel

38	40

39 html_escapes = {	41 html_escapes = {

40 '<': '<',	42 '<': '<',

41 '>': '>',	43 '>': '>',

42 '&': '&',	44 '&': '&',

43 '"': '"',	45 '"': '"',

44 "'": ''',	46 "'": ''',

45 }	47 }

46	48

(...skipping 63 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
110 # the document.	112 # the document.

111 self._append_text(data)	113 self._append_text(data)

112	114

113 def handle_entityref(self, name):	115 def handle_entityref(self, name):

114 self._append_text(self.unescape('&{};'.format(name)))	116 self._append_text(self.unescape('&{};'.format(name)))

115	117

116 def handle_charref(self, name):	118 def handle_charref(self, name):

117 self._append_text(self.unescape('&#{};'.format(name)))	119 self._append_text(self.unescape('&#{};'.format(name)))

118	120

119	121

	122 def parse_metadata(page, data):

	123 metadata = {'page': page}

	124 try:

	125 decoder = json.JSONDecoder()

	126 json_data, index = decoder.raw_decode(data)

	127 metadata.update(json_data)

	128 page_data = data[index:].strip()
	Sebastian Noack 2017/08/29 22:49:39 Note that when parsing the legacy non-JSON format, Note that when parsing the legacy non-JSON format, we blank out empty lines but preserve any line endings. This is necessary for line numbers in jinja2 tracebacks to still match. However, here you strip any new lines that might be used to format the JSON object, and only preserve any content that follows. Vasily Kuznetsov 2017/08/30 10:50:00 This consideration is rather non-obvious, for exam Show quoted text On 2017/08/29 22:49:39, Sebastian Noack wrote: > Note that when parsing the legacy non-JSON format, we blank out empty lines but > preserve any line endings. This is necessary for line numbers in jinja2 > tracebacks to still match. However, here you strip any new lines that might be > used to format the JSON object, and only preserve any content that follows. This consideration is rather non-obvious, for example I didn't know about it. I'd say there should be a comment somewhere (probably in this function) explaining the rationale for preserving the line breaks. We probably also want to preserve the number line breaks in JSON case. And the tests should probably have some newlines before and in the comment and make sure that they stay in place. rosie 2018/03/26 02:32:21 (This function was moved to utils.py.) Now, the li Show quoted text On 2017/08/30 10:50:00, Vasily Kuznetsov wrote: > On 2017/08/29 22:49:39, Sebastian Noack wrote: > > Note that when parsing the legacy non-JSON format, we blank out empty lines > but > > preserve any line endings. This is necessary for line numbers in jinja2 > > tracebacks to still match. However, here you strip any new lines that might be > > used to format the JSON object, and only preserve any content that follows. > > This consideration is rather non-obvious, for example I didn't know about it. > I'd say there should be a comment somewhere (probably in this function) > explaining the rationale for preserving the line breaks. > > We probably also want to preserve the number line breaks in JSON case. And the > tests should probably have some newlines before and in the comment and make sure > that they stay in place. (This function was moved to utils.py.) Now, the line numbers are preserved, but if there is any text in a comment block along with the metadata, it will now also be hidden. It seems that this was the original intent, reading over the issue. The test cases have been updated accordingly.
	129 except ValueError:
	Sebastian Noack 2017/08/29 22:49:39 In case the data can be interpreted as JSON, but r In case the data can be interpreted as JSON, but result into something else than a dict, a TypeError will be thrown instead of a ValueError, by dict.update(). rosie 2018/03/26 02:32:21 The check to make sure a dict is returned now happ Show quoted text On 2017/08/29 22:49:39, Sebastian Noack wrote: > In case the data can be interpreted as JSON, but result into something else than > a dict, a TypeError will be thrown instead of a ValueError, by dict.update(). The check to make sure a dict is returned now happens on line 78.
	130 lines = data.splitlines(True)

	131 for i, line in enumerate(lines):

	132 if not re.search(r'^\s[\w\-]+\s=', line):
	Sebastian Noack 2017/08/29 22:49:39 There is some redundancy in the regular expression There is some redundancy in the regular expression and the logic below, i.e. the fact that there has to be an equals sign and that spaces around the name are ignored, is duplicated in the regular expression used for validation, and in the logic that actually splits the name-value-pair. Instead, we could validate and extract name and value in one go, avoiding duplicated logic: m = re.search(r'^\s([\w\-]+)\s=\s(.?)\s$', line) if not m: break name, value = m.groups() if value.startswith('[') and value.endswith(']'): value = [element.strip() for element in value[1:-1].split(',')] lines[i] = '' metadata[name] = value rosie* 2018/03/26 02:32:20 Done. Show quoted text On 2017/08/29 22:49:39, Sebastian Noack wrote: > There is some redundancy in the regular expression and the logic below, i.e. the > fact that there has to be an equals sign and that spaces around the name are > ignored, is duplicated in the regular expression used for validation, and in the > logic that actually splits the name-value-pair. Instead, we could validate and > extract name and value in one go, avoiding duplicated logic: > > m = re.search(r'^\s([\w\-]+)\s=\s(.?)\s*$', line) > if not m: > break > name, value = m.groups() > if value.startswith('[') and value.endswith(']'): > value = [element.strip() for element in value[1:-1].split(',')] > lines[i] = '' > metadata[name] = value Done.
	133 break

	134 name, value = line.split('=', 1)

	135 value = value.strip()

	136 if value.startswith('[') and value.endswith(']'):

	137 value = [element.strip() for element in value[1:-1].split(',')]

	138 lines[i] = ''

	139 metadata[name.strip()] = value

	140 page_data = ''.join(lines)

	141 return metadata, page_data

	142

	143

120 def parse_page_content(page, data):	144 def parse_page_content(page, data):

121 """Separate page content into metadata (dict) and body text (str)"""	145 """Separate page content into metadata (dict) and body text (str)"""

122 page_data = {'page': page}	146 # If metadata is in a comment block, extract it

123 lines = data.splitlines(True)	147 comment_start = '<!--'
	Sebastian Noack 2017/08/29 22:49:39 It seems much simpler to use a regular expression It seems much simpler to use a regular expression here: m = re.search(r'^\s<!--(.?)-->(.)', data, re.S) if m: comment, page_tail = m.groups() Vasily Kuznetsov* 2017/08/30 10:50:00 I think it was my advice to not use regexps here b Show quoted text On 2017/08/29 22:49:39, Sebastian Noack wrote: > It seems much simpler to use a regular expression here: > > m = re.search(r'^\s<!--(.?)-->(.)', data, re.S) > if m: > comment, page_tail = m.groups() I think it was my advice to not use regexps here because I didn't know about the non-greedy matching (oops :/, now I do). I agree that this approach is better, but I think we need to save the part before the comment as well if we want to be careful about not eating newlines: m = re.search(r'^\(s)<!--(.?)-->(.)', data, re.S) if m: leading_space, comment, page_tail = m.groups() and then reinsert it back before the comment (in which we replace the metadata with just the correct number of newlines). rosie 2018/03/26 02:32:20 Done. Show quoted text On 2017/08/30 10:50:00, Vasily Kuznetsov wrote: > On 2017/08/29 22:49:39, Sebastian Noack wrote: > > It seems much simpler to use a regular expression here: > > > > m = re.search(r'^\s<!--(.?)-->(.)', data, re.S) > > if m: > > comment, page_tail = m.groups() > > I think it was my advice to not use regexps here because I didn't know about the > non-greedy matching (oops :/, now I do). I agree that this approach is better, > but I think we need to save the part before the comment as well if we want to be > careful about not eating newlines: > > m = re.search(r'^\(s)<!--(.?)-->(.)', data, re.S) > if m: > leading_space, comment, page_tail = m.groups() > > and then reinsert it back before the comment (in which we replace the metadata > with just the correct number of newlines). Done.
124 for i, line in enumerate(lines):	148 comment_end = '-->'

125 if line.strip() in {'<!--', '-->'}:	149 if data.lstrip().startswith(comment_start):

126 lines[i] = ''	150 start_index = data.index(comment_start) + len(comment_start)

127 continue	151 end_index = data.index(comment_end)

128 if not re.search(r'^\s[\w\-]+\s=', line):	152 comment = data[start_index:end_index]

129 break	153 page_tail = data[end_index + len(comment_end):]

130 name, value = line.split('=', 1)	154 metadata, comment_data = parse_metadata(page, comment.strip())
	Sebastian Noack 2017/08/29 22:49:39 Does stripping the comment have any effect here? I Does stripping the comment have any effect here? It seems parse_metadata() is ignoring trailing/leading spaces, or do I miss something? Vasily Kuznetsov 2017/08/30 10:50:01 JSON parsing fails on leading space, but if we wan Show quoted text On 2017/08/29 22:49:39, Sebastian Noack wrote: > Does stripping the comment have any effect here? It seems parse_metadata() is > ignoring trailing/leading spaces, or do I miss something? JSON parsing fails on leading space, but if we want to preserve the newlines, we will need to be more careful here. Probably makes sense to do the stripping inside of `parse_metadata()` and make sure that the returned `comment_data` has the same number of lines as the original comment that what was passed in. rosie 2018/03/26 02:32:21 Done. Show quoted text On 2017/08/30 10:50:01, Vasily Kuznetsov wrote: > On 2017/08/29 22:49:39, Sebastian Noack wrote: > > Does stripping the comment have any effect here? It seems parse_metadata() is > > ignoring trailing/leading spaces, or do I miss something? > > JSON parsing fails on leading space, but if we want to preserve the newlines, we > will need to be more careful here. Probably makes sense to do the stripping > inside of `parse_metadata()` and make sure that the returned `comment_data` has > the same number of lines as the original comment that what was passed in. Done.
131 value = value.strip()	155 page_data = '{}\n{}\n{}\n\n{}'.format(comment_start, comment_data,

132 if value.startswith('[') and value.endswith(']'):	156 comment_end, page_tail.strip())

133 value = [element.strip() for element in value[1:-1].split(',')]	157 else:

134 lines[i] = '\n'	158 metadata, page_data = parse_metadata(page, data.strip())

135 page_data[name.strip()] = value	159 return metadata, page_data

136 return page_data, ''.join(lines)

137	160

138	161

139 class Converter:	162 class Converter:

140 whitelist = {'a', 'em', 'sup', 'strong', 'code', 'span'}	163 whitelist = {'a', 'em', 'sup', 'strong', 'code', 'span'}

141 missing_translations = 0	164 missing_translations = 0

142 total_translations = 0	165 total_translations = 0

143	166

144 def __init__(self, params, key='pagedata'):	167 def __init__(self, params, key='pagedata'):

145 self._params = params	168 self._params = params

146 self._key = key	169 self._key = key

(...skipping 409 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
556 stack.pop()	579 stack.pop()

557 stack[-1]['subitems'].append(item)	580 stack[-1]['subitems'].append(item)

558 stack.append(item)	581 stack.append(item)

559 return structured	582 return structured

560	583

561 converters = {	584 converters = {

562 'html': RawConverter,	585 'html': RawConverter,

563 'md': MarkdownConverter,	586 'md': MarkdownConverter,

564 'tmpl': TemplateConverter,	587 'tmpl': TemplateConverter,

565 }	588 }

OLD	NEW

« no previous file with comments | « no previous file | tests/expected_output/en/metadata_json » ('j') | tests/expected_output/en/sitemap » ('J')