abp/filters/parser.py - Issue 29880577: Issue 6877 - Only parse headers in the first line of the filter list

Side by Side Diff: abp/filters/parser.py

Issue 29880577: Issue 6877 - Only parse headers in the first line of the filter list (Closed)

Patch Set: Initial Created Sept. 14, 2018, 4:43 p.m.

Left:
Right:

Use n/p to move between diff chunks; N/P to move between comments.

Jump to:

View unified diff | Download patch

OLD	NEW
1 # This file is part of Adblock Plus <https://adblockplus.org/>,	1 # This file is part of Adblock Plus <https://adblockplus.org/>,

2 # Copyright (C) 2006-present eyeo GmbH	2 # Copyright (C) 2006-present eyeo GmbH

3 #	3 #

4 # Adblock Plus is free software: you can redistribute it and/or modify	4 # Adblock Plus is free software: you can redistribute it and/or modify

5 # it under the terms of the GNU General Public License version 3 as	5 # it under the terms of the GNU General Public License version 3 as

6 # published by the Free Software Foundation.	6 # published by the Free Software Foundation.

7 #	7 #

8 # Adblock Plus is distributed in the hope that it will be useful,	8 # Adblock Plus is distributed in the hope that it will be useful,

9 # but WITHOUT ANY WARRANTY; without even the implied warranty of	9 # but WITHOUT ANY WARRANTY; without even the implied warranty of

10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the	10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

(...skipping 122 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
133	133

134	134

135 Header = _line_type('Header', 'version', '[{.version}]')	135 Header = _line_type('Header', 'version', '[{.version}]')

136 EmptyLine = _line_type('EmptyLine', '', '')	136 EmptyLine = _line_type('EmptyLine', '', '')

137 Comment = _line_type('Comment', 'text', '! {.text}')	137 Comment = _line_type('Comment', 'text', '! {.text}')

138 Metadata = _line_type('Metadata', 'key value', '! {0.key}: {0.value}')	138 Metadata = _line_type('Metadata', 'key value', '! {0.key}: {0.value}')

139 Filter = _line_type('Filter', 'text selector action options', '{.text}')	139 Filter = _line_type('Filter', 'text selector action options', '{.text}')

140 Include = _line_type('Include', 'target', '%include {0.target}%')	140 Include = _line_type('Include', 'target', '%include {0.target}%')

141	141

142	142

143 METADATA_REGEXP = re.compile(r'(.?)\s:\s(.)')	143 METADATA_REGEXP = re.compile(r'\s!\s(.?)\s:\s(.)')

144 INCLUDE_REGEXP = re.compile(r'%include\s+(.+)%')	144 INCLUDE_REGEXP = re.compile(r'%include\s+(.+)%')

145 HEADER_REGEXP = re.compile(r'\[(Adblock(?:\sPlus\s[\d\.]+?)?)\]', flags=re.I)	145 HEADER_REGEXP = re.compile(r'\[(?:(Adblock(?:\sPlus\s[\d\.]+?)?)\|.*)\]$',
	Sebastian Noack 2018/09/15 16:08:32 I changed this regular epxressions like this in my I changed this regular epxressions like this in my patch so that I don't have to first check whether the line starts and ends with square brackets. With your implementation this seems redundant. Vasily Kuznetsov 2018/09/17 10:40:27 Yeah, you're right. I think the logic of parse_lin Show quoted text On 2018/09/15 16:08:32, Sebastian Noack wrote: > I changed this regular epxressions like this in my patch so that I don't have to > first check whether the line starts and ends with square brackets. With your > implementation this seems redundant. Yeah, you're right. I think the logic of parse_line() is more clear the way it is so I will undo the regexp change. Vasily Kuznetsov 2018/09/18 12:41:14 Done. Show quoted text On 2018/09/17 10:40:27, Vasily Kuznetsov wrote: > On 2018/09/15 16:08:32, Sebastian Noack wrote: > > I changed this regular epxressions like this in my patch so that I don't have > to > > first check whether the line starts and ends with square brackets. With your > > implementation this seems redundant. > > Yeah, you're right. I think the logic of parse_line() is more clear the way it > is so I will undo the regexp change. Done.
	146 flags=re.I)

146 HIDING_FILTER_REGEXP = re.compile(r'^([^/\|@"!]?)#([@?])?#(.+)$')	147 HIDING_FILTER_REGEXP = re.compile(r'^([^/\|@"!]?)#([@?])?#(.+)$')

147 FILTER_OPTIONS_REGEXP = re.compile(	148 FILTER_OPTIONS_REGEXP = re.compile(

148 r'\$(~?[\w-]+(?:=[^,]+)?(?:,~?[\w-]+(?:=[^,]+)?)*)$'	149 r'\$(~?[\w-]+(?:=[^,]+)?(?:,~?[\w-]+(?:=[^,]+)?)*)$'

149 )	150 )

150	151

151	152

152 def _parse_header(text):	153 def _parse_header(text):

153 match = HEADER_REGEXP.match(text)	154 match = HEADER_REGEXP.match(text)

154 if not match:	155 version = match.group(1) if match else None

	156 if not version:

155 raise ParseError('Malformed header', text)	157 raise ParseError('Malformed header', text)

156 return Header(match.group(1))	158 return Header(version)

157	159

158	160

159 def _parse_instruction(text):	161 def _parse_instruction(text):

160 match = INCLUDE_REGEXP.match(text)	162 match = INCLUDE_REGEXP.match(text)

161 if not match:	163 if not match:

162 raise ParseError('Unrecognized instruction', text)	164 raise ParseError('Unrecognized instruction', text)

163 return Include(match.group(1))	165 return Include(match.group(1))

164	166

165	167

166 def _parse_option(option):	168 def _parse_option(option):

(...skipping 77 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
244 Parsed filter.	246 Parsed filter.

245	247

246 """	248 """

247 if '#' in text:	249 if '#' in text:

248 match = HIDING_FILTER_REGEXP.search(text)	250 match = HIDING_FILTER_REGEXP.search(text)

249 if match:	251 if match:

250 return _parse_hiding_filter(text, *match.groups())	252 return _parse_hiding_filter(text, *match.groups())

251 return _parse_blocking_filter(text)	253 return _parse_blocking_filter(text)

252	254

253	255

254 def parse_line(line_text):	256 def parse_line(line_text, mode='body'):
	Sebastian Noack 2018/09/15 16:08:32 Having the "mode" as part of the public API, requi Having the "mode" as part of the public API, requires to document it (you did that below), and probably also calls for more tests than we currently have. By not exposing this implementation detail, the public API (and it's documentation and tests) can be simpler. Also we wouldn't need to validate the value below. Vasily Kuznetsov 2018/09/17 10:40:27 I would like to keep the mode in the public API. T Show quoted text On 2018/09/15 16:08:32, Sebastian Noack wrote: > Having the "mode" as part of the public API, requires to document it (you did > that below), and probably also calls for more tests than we currently have. By > not exposing this implementation detail, the public API (and it's documentation > and tests) can be simpler. Also we wouldn't need to validate the value below. I would like to keep the mode in the public API. This way full functionality or line parsing is available to the users and I think it's a good thing. You're right about the tests, however, I will add them.
255 """Parse one line of a filter list.	257 """Parse one line of a filter list.

256	258

257 Note that parse_line() doesn't handle special comments, hence never returns	259 The types of lines that that the parser recognizes depend on the mode. In

258 a Metadata() object, Adblock Plus only considers metadata when parsing the	260 body mode the parser only recognizes filters, comments, processing

259 whole filter list and only if they are given at the top of the filter list.	261 instructions and empty lines. In medata mode it in addition recognizes

	262 metadata. In start mode it also recognizes headers.

	263

	264 Note: checksum metadata lines are recognized in all modes for backwards

	265 compatibility. Historically, checksums can occur at the bottom of the

	266 filter list. They are are no longer used by Adblock Plus, but in order to

	267 strip them (in abp.filters.renderer), we have to make sure to still parse

	268 them regardless of their position in the filter list.

260	269

261 Parameters	270 Parameters

262 ----------	271 ----------

263 line_text : str	272 line_text : str

264 Line of a filter list.	273 Line of a filter list.

	274 mode : str

	275 Parsing mode, one of "start", "metadata" or "body" (default).

265	276

266 Returns	277 Returns

267 -------	278 -------

268 namedtuple	279 namedtuple

269 Parsed line (see `_line_type`).	280 Parsed line (see `_line_type`).

270	281

271 Raises	282 Raises

272 ------	283 ------

273 ParseError	284 ParseError

274 ParseError: If the line can't be parsed.	285 ParseError: If the line can't be parsed.

	286

275 """	287 """

	288 MODES = {'body', 'start', 'metadata'}

	289 if mode not in MODES:

	290 raise ValueError('mode should be one of {}'.format(MODES))

	291

276 if isinstance(line_text, type(b'')):	292 if isinstance(line_text, type(b'')):

277 line_text = line_text.decode('utf-8')	293 line_text = line_text.decode('utf-8')

278	294

279 content = line_text.strip()	295 content = line_text.strip()
Sebastian Noack 2018/09/15 16:08:32 Adblock Plus doesn't strip the line before process Adblock Plus doesn't strip the line before processing headers and metadata, i.e. a line with leading and/or trailing whitespaces isn't considered a valid header, and trailing whitespaces in metadata values are preserved. Vasily Kuznetsov 2018/09/17 10:40:27 The behavior of ABP for the headers seems right. I Show quoted text On 2018/09/15 16:08:32, Sebastian Noack wrote: > Adblock Plus doesn't strip the line before processing headers and metadata, i.e. > a line with leading and/or trailing whitespaces isn't considered a valid header, > and trailing whitespaces in metadata values are preserved. The behavior of ABP for the headers seems right. I will adjust the code here. However I'm not so sure about preserving the trailing space. Do you think is desirable? I mean do you think ABP is doing the right thing in this case -- I agree that python-abp should behave the same. Sebastian Noack 2018/09/17 18:11:52 Adblock Plus extracts metadata (and the header) be Show quoted text On 2018/09/17 10:40:27, Vasily Kuznetsov wrote: > However I'm not so sure about preserving the trailing space. Do you think is > desirable? I mean do you think ABP is doing the right thing in this case -- I > agree that python-abp should behave the same. Adblock Plus extracts metadata (and the header) before parsing the remaining filter list content (so does my patch for python-abp). The regular expression used to identify metadata and extract the key and value would be more complicated if we'd want to strip trailing whitespaces from the value, and in practice it doesn't seem to matter whether trailing spaces are stripped. Vasily Kuznetsov 2018/09/18 12:41:14 Acknowledged. Show quoted text On 2018/09/17 18:11:52, Sebastian Noack wrote: > On 2018/09/17 10:40:27, Vasily Kuznetsov wrote: > > However I'm not so sure about preserving the trailing space. Do you think is > > desirable? I mean do you think ABP is doing the right thing in this case -- I > > agree that python-abp should behave the same. > > Adblock Plus extracts metadata (and the header) before parsing the remaining > filter list content (so does my patch for python-abp). The regular expression > used to identify metadata and extract the key and value would be more > complicated if we'd want to strip trailing whitespaces from the value, and in > practice it doesn't seem to matter whether trailing spaces are stripped. Acknowledged.
280	296

281 if content == '':	297 if content == '':

282 line = EmptyLine()	298 return EmptyLine()

283 elif content.startswith('!'):

284 line = Comment(content[1:].lstrip())

285 elif content.startswith('%') and content.endswith('%'):

286 line = _parse_instruction(content)

287 elif content.startswith('[') and content.endswith(']'):

288 line = _parse_header(content)

289 else:

290 line = parse_filter(content)

291	299

292 assert line.to_string().replace(' ', '') == content.replace(' ', '')	300 if content.startswith('!'):

293 return line	301 match = METADATA_REGEXP.match(line_text)

	302 if match:

	303 key, value = match.groups()

	304 if mode != 'body' or key.lower() == 'checksum':
	Sebastian Noack 2018/09/15 16:08:32 We probably should keep the comment why we treat c We probably should keep the comment why we treat checksums special here. Vasily Kuznetsov 2018/09/17 10:40:27 I would like to also keep the note about checksums Show quoted text On 2018/09/15 16:08:32, Sebastian Noack wrote: > We probably should keep the comment why we treat checksums special here. I would like to also keep the note about checksums in the docstring also. Do you think it would be ok to just refer to the docstring here? Sebastian Noack 2018/09/17 18:11:52 I didn't notice that you moved that note to the do Show quoted text On 2018/09/17 10:40:27, Vasily Kuznetsov wrote: > On 2018/09/15 16:08:32, Sebastian Noack wrote: > > We probably should keep the comment why we treat checksums special here. > > I would like to also keep the note about checksums in the docstring also. Do you > think it would be ok to just refer to the docstring here? I didn't notice that you moved that note to the docstring. I personally, would rather kept it here, but fair enough. Vasily Kuznetsov 2018/09/18 12:41:14 It needs to be in the docstring because it's part Show quoted text On 2018/09/17 18:11:52, Sebastian Noack wrote: > On 2018/09/17 10:40:27, Vasily Kuznetsov wrote: > > On 2018/09/15 16:08:32, Sebastian Noack wrote: > > > We probably should keep the comment why we treat checksums special here. > > > > I would like to also keep the note about checksums in the docstring also. Do > you > > think it would be ok to just refer to the docstring here? > > I didn't notice that you moved that note to the docstring. I personally, would > rather kept it here, but fair enough. It needs to be in the docstring because it's part of the external behavior / API. I agree that mentioning it here is helpful, so I'll do both.
	305 return Metadata(key, value)

	306 return Comment(content[1:].lstrip())

	307

	308 if content.startswith('%') and content.endswith('%'):

	309 return _parse_instruction(content)

	310

	311 if mode == 'start' and content.startswith('[') and content.endswith(']'):

	312 return _parse_header(content)

	313

	314 return parse_filter(content)

294	315

295	316

296 def parse_filterlist(lines):	317 def parse_filterlist(lines):

297 """Parse filter list from an iterable.	318 """Parse filter list from an iterable.

298	319

299 Parameters	320 Parameters

300 ----------	321 ----------

301 lines: iterable of str	322 lines: iterable of str

302 Lines of the filter list.	323 Lines of the filter list.

303	324

304 Returns	325 Returns

305 -------	326 -------

306 iterator of namedtuple	327 iterator of namedtuple

307 Parsed lines of the filter list.	328 Parsed lines of the filter list.

308	329

309 Raises	330 Raises

310 ------	331 ------

311 ParseError	332 ParseError

312 Thrown during iteration for invalid filter list lines.	333 Thrown during iteration for invalid filter list lines.

313 TypeError	334 TypeError

314 If `lines` is not iterable.	335 If `lines` is not iterable.

315	336

316 """	337 """

317 metadata_closed = False	338 mode = 'start'
	Sebastian Noack 2018/09/17 18:11:52 Maybe "position" would be more accurate name for t Maybe "position" would be more accurate name for this variable? Also instead of "start" maybe "header" would be more in line with the other values. Vasily Kuznetsov 2018/09/18 12:41:14 Yeah, "position" is a better name. I changed it. Show quoted text On 2018/09/17 18:11:52, Sebastian Noack wrote: > Maybe "position" would be more accurate name for this variable? Also instead of > "start" maybe "header" would be more in line with the other values. Yeah, "position" is a better name. I changed it. As for "start" vs. "header" -- I have chosen the former because there's not always a header in the file, so giving the impression that we're only looking for a header seems misleading.
318	339

319 for line in lines:	340 for line in lines:

320 result = parse_line(line)	341 parsed_line = parse_line(line, mode)

	342 yield parsed_line

321	343

322 if result.type == 'comment':	344 if mode != 'body' and parsed_line.type in {'header', 'metadata'}:

323 match = METADATA_REGEXP.match(result.text)	345 # Continue parsing metadata if it's not over...

324 if match:	346 mode = 'metadata'

325 key, value = match.groups()	347 else:

326	348 # ...otherwise switch to parsing filter list body.

327 # Historically, checksums can occur at the bottom of the	349 mode = 'body'

328 # filter list. Checksums are no longer used by Adblock Plus,

329 # but in order to strip them (in abp.filters.renderer),

330 # we have to make sure to still parse them regardless of

331 # their position in the filter list.

332 if not metadata_closed or key.lower() == 'checksum':

333 result = Metadata(key, value)

334

335 if result.type not in {'header', 'metadata'}:

336 metadata_closed = True

337

338 yield result

OLD	NEW

« no previous file with comments | « no previous file | abp/filters/rpy.py » ('j') | no next file with comments »