abp/filters/parser.py - Issue 29880577: Issue 6877 - Only parse headers in the first line of the filter list

Delta Between Two Patch Sets: abp/filters/parser.py

Issue 29880577: Issue 6877 - Only parse headers in the first line of the filter list (Closed)

Left Patch Set: Correct behavior, add comments, improve naming, add tests Created Sept. 18, 2018, 12:37 p.m.

Right Patch Set: Fix header parsing, improve argument naming and documentation Created Sept. 18, 2018, 6:06 p.m.

Left:
Right:

Use n/p to move between diff chunks; N/P to move between comments.

Jump to:

Left: Side by side diff | Download
Right: Side by side diff | Download

LEFT	RIGHT
1 # This file is part of Adblock Plus <https://adblockplus.org/>,	1 # This file is part of Adblock Plus <https://adblockplus.org/>,

2 # Copyright (C) 2006-present eyeo GmbH	2 # Copyright (C) 2006-present eyeo GmbH

3 #	3 #

4 # Adblock Plus is free software: you can redistribute it and/or modify	4 # Adblock Plus is free software: you can redistribute it and/or modify

5 # it under the terms of the GNU General Public License version 3 as	5 # it under the terms of the GNU General Public License version 3 as

6 # published by the Free Software Foundation.	6 # published by the Free Software Foundation.

7 #	7 #

8 # Adblock Plus is distributed in the hope that it will be useful,	8 # Adblock Plus is distributed in the hope that it will be useful,

9 # but WITHOUT ANY WARRANTY; without even the implied warranty of	9 # but WITHOUT ANY WARRANTY; without even the implied warranty of

10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the	10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

(...skipping 124 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
135 Header = _line_type('Header', 'version', '[{.version}]')	135 Header = _line_type('Header', 'version', '[{.version}]')

136 EmptyLine = _line_type('EmptyLine', '', '')	136 EmptyLine = _line_type('EmptyLine', '', '')

137 Comment = _line_type('Comment', 'text', '! {.text}')	137 Comment = _line_type('Comment', 'text', '! {.text}')

138 Metadata = _line_type('Metadata', 'key value', '! {0.key}: {0.value}')	138 Metadata = _line_type('Metadata', 'key value', '! {0.key}: {0.value}')

139 Filter = _line_type('Filter', 'text selector action options', '{.text}')	139 Filter = _line_type('Filter', 'text selector action options', '{.text}')

140 Include = _line_type('Include', 'target', '%include {0.target}%')	140 Include = _line_type('Include', 'target', '%include {0.target}%')

141	141

142	142

143 METADATA_REGEXP = re.compile(r'\s!\s(.?)\s:\s(.)')	143 METADATA_REGEXP = re.compile(r'\s!\s(.?)\s:\s(.)')

144 INCLUDE_REGEXP = re.compile(r'%include\s+(.+)%')	144 INCLUDE_REGEXP = re.compile(r'%include\s+(.+)%')

145 HEADER_REGEXP = re.compile(r'\[(Adblock(?:\sPlus\s[\d\.]+?)?)\]', flags=re.I)	145 HEADER_REGEXP = re.compile(r'\[(?:(Adblock(?:\sPlus\s[\d\.]+?)?)\|.*)\]',

	146 flags=re.I)

146 HIDING_FILTER_REGEXP = re.compile(r'^([^/\|@"!]?)#([@?])?#(.+)$')	147 HIDING_FILTER_REGEXP = re.compile(r'^([^/\|@"!]?)#([@?])?#(.+)$')

147 FILTER_OPTIONS_REGEXP = re.compile(	148 FILTER_OPTIONS_REGEXP = re.compile(

148 r'\$(~?[\w-]+(?:=[^,]+)?(?:,~?[\w-]+(?:=[^,]+)?)*)$'	149 r'\$(~?[\w-]+(?:=[^,]+)?(?:,~?[\w-]+(?:=[^,]+)?)*)$'

149 )	150 )

150

151

152 def _parse_header(text):

153 match = HEADER_REGEXP.match(text)

154 if not match:

155 raise ParseError('Malformed header', text)

156 return Header(match.group(1))

157	151

158	152

159 def _parse_instruction(text):	153 def _parse_instruction(text):

160 match = INCLUDE_REGEXP.match(text)	154 match = INCLUDE_REGEXP.match(text)

161 if not match:	155 if not match:

162 raise ParseError('Unrecognized instruction', text)	156 raise ParseError('Unrecognized instruction', text)

163 return Include(match.group(1))	157 return Include(match.group(1))

164	158

165	159

166 def _parse_option(option):	160 def _parse_option(option):

(...skipping 77 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
244 Parsed filter.	238 Parsed filter.

245	239

246 """	240 """

247 if '#' in text:	241 if '#' in text:

248 match = HIDING_FILTER_REGEXP.search(text)	242 match = HIDING_FILTER_REGEXP.search(text)

249 if match:	243 if match:

250 return _parse_hiding_filter(text, *match.groups())	244 return _parse_hiding_filter(text, *match.groups())

251 return _parse_blocking_filter(text)	245 return _parse_blocking_filter(text)

252	246

253	247

254 def parse_line(line_text, mode='body'):	248 def parse_line(line, position='body'):
Sebastian Noack 2018/09/18 15:19:08 We probably should call the argument here "positio We probably should call the argument here "position" too. Vasily Kuznetsov 2018/09/18 18:11:44 I originally left it as "mode" on purpose, but now Show quoted text On 2018/09/18 15:19:08, Sebastian Noack wrote: > We probably should call the argument here "position" too. I originally left it as "mode" on purpose, but now it seems that "position" is also a good name and it's more consistent with the use in parse_filterlist(). Done.
255 """Parse one line of a filter list.	249 """Parse one line of a filter list.

256	250

257 The types of lines that that the parser recognizes depend on the mode. In	251 The types of lines that that the parser recognizes depend on the position.

258 body mode the parser only recognizes filters, comments, processing	252 If position="body", the parser only recognizes filters, comments,

259 instructions and empty lines. In medata mode it in addition recognizes	253 processing instructions and empty lines. If position="metadata", it in

260 metadata. In start mode it also recognizes headers.	254 addition recognizes metadata. If position="start", it also recognizes

261	255 headers.

262 Note: checksum metadata lines are recognized in all modes for backwards	256
Sebastian Noack 2018/09/18 15:19:07 Typo: Capitalize the first word after a colon. Typo: Capitalize the first word after a colon. Vasily Kuznetsov 2018/09/18 18:11:44 Done. Show quoted text On 2018/09/18 15:19:07, Sebastian Noack wrote: > Typo: Capitalize the first word after a colon. Done.
	257 Note: Checksum metadata lines are recognized in all positions for backwards

263 compatibility. Historically, checksums can occur at the bottom of the	258 compatibility. Historically, checksums can occur at the bottom of the

264 filter list. They are are no longer used by Adblock Plus, but in order to	259 filter list. They are are no longer used by Adblock Plus, but in order to

265 strip them (in abp.filters.renderer), we have to make sure to still parse	260 strip them (in abp.filters.renderer), we have to make sure to still parse

266 them regardless of their position in the filter list.	261 them regardless of their position in the filter list.

267	262

268 Parameters	263 Parameters

269 ----------	264 ----------

270 line_text : str	265 line : str

271 Line of a filter list.	266 Line of a filter list.

272 mode : str	267 position : str

273 Parsing mode, one of "start", "metadata" or "body" (default).	268 Position in the filter list, one of "start", "metadata" or "body"

	269 (default is "body").

274	270

275 Returns	271 Returns

276 -------	272 -------

277 namedtuple	273 namedtuple

278 Parsed line (see `_line_type`).	274 Parsed line (see `_line_type`).

279	275

280 Raises	276 Raises

281 ------	277 ------

282 ParseError	278 ParseError

283 ParseError: If the line can't be parsed.	279 ParseError: If the line can't be parsed.

284	280

285 """	281 """

286 MODES = {'body', 'start', 'metadata'}	282 POSITIONS = {'body', 'start', 'metadata'}

287 if mode not in MODES:	283 if position not in POSITIONS:

288 raise ValueError('mode should be one of {}'.format(MODES))	284 raise ValueError('position should be one of {}'.format(POSITIONS))

289	285

290 if isinstance(line_text, type(b'')):	286 if isinstance(line, type(b'')):

291 line_text = line_text.decode('utf-8')	287 line = line.decode('utf-8')

292	288

293 content = line_text.strip()	289 stripped = line.strip()

294	290

295 if content == '':	291 if stripped == '':

296 return EmptyLine()	292 return EmptyLine()

297	293

298 if content.startswith('!'):	294 if position == 'start':

299 match = METADATA_REGEXP.match(line_text)	295 match = HEADER_REGEXP.search(line)

	296 if match:

	297 version = match.group(1)

	298 if not version:

	299 raise ParseError('Malformed header', line)

	300 return Header(version)

	301

	302 if stripped.startswith('!'):

	303 match = METADATA_REGEXP.match(line)

300 if match:	304 if match:

301 key, value = match.groups()	305 key, value = match.groups()

302 # Metadata is only parsed in start or metadata modes but there's	306 if position != 'body' or key.lower() == 'checksum':

303 # an exception for checksums (see docstring).
Sebastian Noack 2018/09/18 15:19:08 Nit: IMO, this comment is redundant. That we speci Nit: IMO, this comment is redundant. That we special case "checksum" is obvious from the code, the reason why is no covered in the docstring. Vasily Kuznetsov 2018/09/18 18:11:43 Done. Show quoted text On 2018/09/18 15:19:08, Sebastian Noack wrote: > Nit: IMO, this comment is redundant. That we special case "checksum" is obvious > from the code, the reason why is no covered in the docstring. Done.
304 if mode != 'body' or key.lower() == 'checksum':

305 return Metadata(key, value)	307 return Metadata(key, value)

306 return Comment(content[1:].lstrip())	308 return Comment(stripped[1:].lstrip())

307	309

308 if content.startswith('%') and content.endswith('%'):	310 if stripped.startswith('%') and stripped.endswith('%'):

309 return _parse_instruction(content)	311 return _parse_instruction(stripped)

310	312

311 if mode == 'start' and (line_text.startswith('[') and	313 return parse_filter(stripped)

312 line_text.endswith(']')):
Sebastian Noack 2018/09/18 15:19:08 Nit: Perhaps just rename the variable (e.g. to "li Nit: Perhaps just rename the variable (e.g. to "line"), so that you don't need to wrap here. Vasily Kuznetsov 2018/09/18 18:11:43 Done. Show quoted text On 2018/09/18 15:19:08, Sebastian Noack wrote: > Nit: Perhaps just rename the variable (e.g. to "line"), so that you don't need > to wrap here. Done.
313 return _parse_header(content)
Sebastian Noack 2018/09/18 15:19:08 The result should be the same, since by asserting The result should be the same, since by asserting that the lines starts end ends with square brackets, it's implied that there are no leading or trailing whitespaces, but it might be easier to verify the code without having to make that assumption by passing in line_text instead of content here. Vasily Kuznetsov 2018/09/18 18:11:43 Done. Show quoted text On 2018/09/18 15:19:08, Sebastian Noack wrote: > The result should be the same, since by asserting that the lines starts end ends > with square brackets, it's implied that there are no leading or trailing > whitespaces, but it might be easier to verify the code without having to make > that assumption by passing in line_text instead of content here. Done.
314

315 return parse_filter(content)

316	314

317	315

318 def parse_filterlist(lines):	316 def parse_filterlist(lines):

319 """Parse filter list from an iterable.	317 """Parse filter list from an iterable.

320	318

321 Parameters	319 Parameters

322 ----------	320 ----------

323 lines: iterable of str	321 lines: iterable of str

324 Lines of the filter list.	322 Lines of the filter list.

325	323

(...skipping 15 matching lines...) Expand all Loading...
341 for line in lines:	339 for line in lines:

342 parsed_line = parse_line(line, position)	340 parsed_line = parse_line(line, position)

343 yield parsed_line	341 yield parsed_line

344	342

345 if position != 'body' and parsed_line.type in {'header', 'metadata'}:	343 if position != 'body' and parsed_line.type in {'header', 'metadata'}:

346 # Continue parsing metadata until it's over...	344 # Continue parsing metadata until it's over...

347 position = 'metadata'	345 position = 'metadata'

348 else:	346 else:

349 # ...then switch to parsing the body.	347 # ...then switch to parsing the body.

350 position = 'body'	348 position = 'body'

LEFT	RIGHT