OLD | NEW |
1 # python-abp | 1 # python-abp |
2 | 2 |
3 This repository contains the script that is used for building Adblock Plus | 3 This repository contains a library for working with Adblock Plus filter lists |
4 filter lists from the form in which they are authored into the format suitable | 4 and the script that is used for building Adblock Plus filter lists from the |
5 for consumption by the adblocking software. | 5 form in which they are authored into the format suitable for consumption by the |
| 6 adblocking software. |
6 | 7 |
7 ## Installation | 8 ## Installation |
8 | 9 |
9 Prerequisites: | 10 Prerequisites: |
10 | 11 |
11 * Linux, Mac OS X or Windows (any modern Unix should work too), | 12 * Linux, Mac OS X or Windows (any modern Unix should work too), |
12 * Python (2.7 or 3.5), | 13 * Python (2.7 or 3.5, 3.6), |
13 * pip. | 14 * pip. |
14 | 15 |
15 To install: | 16 To install: |
16 | 17 |
17 $ pip install -U python-abp | 18 $ pip install -U python-abp |
18 | 19 |
19 ## Rendering of filter lists | 20 ## Rendering of filter lists |
20 | 21 |
21 The filter lists are originally authored in relatively smaller parts focused | 22 The filter lists are originally authored in relatively smaller parts focused |
22 on a particular type of filters, related to a specific topic or relevant | 23 on a particular type of filters, related to a specific topic or relevant |
23 for particular geographical area. | 24 for particular geographical area. |
24 We call these parts _filter list fragments_ (or just _fragments_) | 25 We call these parts _filter list fragments_ (or just _fragments_) |
25 to distinguish them from full filter lists that are | 26 to distinguish them from full filter lists that are |
26 consumed by the adblocking software such as Adblock Plus. | 27 consumed by the adblocking software such as Adblock Plus. |
27 | 28 |
28 Rendering is a process that combines filter list fragments into a filter list. | 29 Rendering is a process that combines filter list fragments into a filter list. |
29 It starts with one fragment that can include other ones and so forth. | 30 It starts with one fragment that can include other ones and so forth. |
30 The produced filter list is marked with a version, a timestamp and | 31 The produced filter list is marked with a version, a timestamp and |
31 a [checksum](https://adblockplus.org/filters#special-comments). | 32 a [checksum][1]. |
32 | 33 |
33 Python-abp contains a script that can do this called `flrender`: | 34 Python-abp contains a script that can do this called `flrender`: |
34 | 35 |
35 $ flrender fragment.txt filterlist.txt | 36 $ flrender fragment.txt filterlist.txt |
36 | 37 |
37 This will take the top level fragment in `fragment.txt`, render it and save into | 38 This will take the top level fragment in `fragment.txt`, render it and save into |
38 `filterlist.txt`. | 39 `filterlist.txt`. |
39 | 40 |
40 Fragments might reference other fragments that should be included into them. | 41 Fragments might reference other fragments that should be included into them. |
41 The references come in two forms: http(s) includes and local includes: | 42 The references come in two forms: http(s) includes and local includes: |
42 | 43 |
43 %include http://www.server.org/dir/list.txt% | 44 %include http://www.server.org/dir/list.txt% |
44 %include easylist:easylist/easylist_general_block.txt | 45 %include easylist:easylist/easylist_general_block.txt% |
45 | 46 |
46 The first instruction contains a URL that will be fetched and inserted at the | 47 The first instruction contains a URL that will be fetched and inserted at the |
47 point of reference. | 48 point of reference. |
48 The second one contains a path inside easylist repository. | 49 The second one contains a path inside easylist repository. |
49 `flrender` needs to be able to find a copy of the repository on the local | 50 `flrender` needs to be able to find a copy of the repository on the local |
50 filesystem. We use `-i` option to point it to to the right directory: | 51 filesystem. We use `-i` option to point it to to the right directory: |
51 | 52 |
52 $ flrender -i easylist=/home/abc/easylist input.txt output.txt | 53 $ flrender -i easylist=/home/abc/easylist input.txt output.txt |
53 | 54 |
54 Now the second reference above will be resolved to | 55 Now the second reference above will be resolved to |
(...skipping 13 matching lines...) Expand all Loading... |
68 If you don't know all the source names that are needed to render some list, | 69 If you don't know all the source names that are needed to render some list, |
69 just run `flrender` and it will report what it's missing: | 70 just run `flrender` and it will report what it's missing: |
70 | 71 |
71 $ flrender easylist.txt output/easylist.txt | 72 $ flrender easylist.txt output/easylist.txt |
72 Unknown source: 'easylist' when including 'easylist:easylist/easylist_gener | 73 Unknown source: 'easylist' when including 'easylist:easylist/easylist_gener |
73 al_block.txt' from 'easylist.txt' | 74 al_block.txt' from 'easylist.txt' |
74 | 75 |
75 You can clone the necessary repositories to a local directory and add `-i` | 76 You can clone the necessary repositories to a local directory and add `-i` |
76 options accordingly. | 77 options accordingly. |
77 | 78 |
| 79 ## Library API |
| 80 |
| 81 Python-abp can also be used as a library for parsing filter lists. For example |
| 82 to read a filter list (we use Python 3 syntax here but the API is the same): |
| 83 |
| 84 from abp.filter import parse_filterlist |
| 85 |
| 86 with open('filterlist.txt') as filterlist: |
| 87 for line in parse_filterlist(filterlist): |
| 88 print(line) |
| 89 |
| 90 If `filterlist.txt` contains a filter list, the output will look similar to |
| 91 the following: |
| 92 |
| 93 Header(version='Adblock Plus 2.0') |
| 94 Metadata(key='Title', value='Example List') |
| 95 EmptyLine() |
| 96 Filter(text='abc.com,cdf.com##div#ad1', selector={'type': 'css', 'value': |
| 97 'div#ad1'}, action='hide', options={'domains-include': ['abc.com', |
| 98 'cdf.com'], 'domains-none': True}) |
| 99 Filter(text='abc.com/ad$image', selector={'type': 'url-pattern', 'value': |
| 100 'abc.com/ad'}, action='block', options={'types-none': True, |
| 101 'types-include': ['image']}) |
| 102 Filter(text='@@/abc\\.com/', selector={'type': 'url-regexp', 'value': |
| 103 'abc\\.com'}, action='allow', options={}) |
| 104 ... |
| 105 |
| 106 In general `parse_filterlist` takes an iterable of strings (such as a list or |
| 107 an open file) and returns an iterable of parsed filter list lines. Each line |
| 108 will have its `.type` attribute set to a string indicating its type. It will |
| 109 also have a `.to_string()` method that converts it to a unicode string in the |
| 110 filter list format (most of the time it's the same as the string from which the |
| 111 filter was parsed). Further attributes depend on the type of the line. |
| 112 |
| 113 **Note:** `parse_filterlist` returns an iterator, not a list, and only consumes |
| 114 the input lines when its output is iterated over. This allows much more memory |
| 115 efficient handling of large filter lists, however there are two things to watch |
| 116 out for: |
| 117 |
| 118 - When you're parsing filters from a file, you need to complete the iteration |
| 119 before you close the file. |
| 120 - Once you iterate over the output of `parse_filterlist` once, it will be |
| 121 consumed and you won't be iterate over it again. |
| 122 |
| 123 If you find that any of these issues is bothering you, you probably want to |
| 124 convert the output of `parse_filterlist` to a list: |
| 125 |
| 126 lines_list = list(parse_filterlist(filterlist)) |
| 127 |
| 128 This will load the whole file into memory but unless you're dealing with a |
| 129 gigantic filter list that should not be a problem. |
| 130 |
| 131 ### Line types |
| 132 |
| 133 As mentioned before, lines of different types have different attributes: |
| 134 |
| 135 | type | attributes
| |
| 136 |------------|------------------------------------------------------------------
------| |
| 137 | header | `version` - plugin version string
| |
| 138 | emptyline | no options
| |
| 139 | comment | `text` - text of the comment
| |
| 140 | metadata | `key` - name of the metadata field, `value` - value of the field
| |
| 141 | include | `target` - url/path of the file to include
| |
| 142 | invalid | `text` - full text of the line, error - error message
| |
| 143 | filter | `text` - text of the filter, `selector` - what to look for, `acti
on` - what to do with selected items, `options` - filter options | |
| 144 |
| 145 #### Filter atributes |
| 146 |
| 147 Selector is a dictionary with two keys: |
| 148 |
| 149 | key | meaning | |
| 150 |--------------|----------------------------------------------------| |
| 151 | type | 'css', 'abp-simple', 'url-pattern', 'url-regexp' | |
| 152 | value | the selector itself, the meaning is type-dependent | |
| 153 |
| 154 Options is a dictionary with a variable set of keys. Only options that are |
| 155 actually present in the filter will be stored there. The list of possible option
s |
| 156 and their meanings can be found in [documentation on authoring the filter |
| 157 rules][2]. |
| 158 |
| 159 There are four classes of options that are handled differently: |
| 160 |
| 161 - Type options (that make the rule apply or not apply to certain types of |
| 162 requests and resources): |
| 163 - `types-include`: List of additional types to which the rule applies. |
| 164 - `types-exclude`: List of types to which the rule doesn't apply. |
| 165 - `types-none`: If this is `True`, the filter only applies to the types |
| 166 in `types-include`. Otherwise all types except for `document`, `popup`, |
| 167 `elemhide`, `generichide` and `genericblock` are implicitly included. |
| 168 - Domain options (that make the rule apply or not apply to specific domains): |
| 169 - `domains-include`: List of domains to which the rule applies (it will also |
| 170 apply to any subdomains unless they are excluded). |
| 171 - `domains-exclude`: Excluded domains (their subdomains are also excluded |
| 172 unless specifically included). |
| 173 - `domains-none`: If this is `True`, all domains that are not mentioned by |
| 174 `domains-include` and `domains-exclude` are excluded. Otherwise they are |
| 175 included. |
| 176 - `sitekeys`: List of sitekeys that can be used to activate the rule. |
| 177 - Flags: `third-party`, `collapse`, `match-case`, etc. See [documentation][2] |
| 178 for more information on their meaning. |
| 179 |
| 180 ### Other functions |
| 181 |
| 182 `abp.filters` module also exports two lower-level functions for parsing |
| 183 individual lines of filter list or individual filters. Not very surprisingly |
| 184 they are called `parse_line` and `parse_filter` respectively. Both will return |
| 185 a parsed line object just like the items in the iterator returned by |
| 186 `parse_filterlist`. The difference between them is that `parse_line` tries to |
| 187 do line type detection and `parse_filter` will always try to interpret things |
| 188 as a filter. Both functions will throw a `ParseError` exception instead of |
| 189 returning a line with `type="invalid"`. |
| 190 |
78 ## Testing | 191 ## Testing |
79 | 192 |
80 Unit tests for `python-abp` are located in the `/tests` directory. | 193 Unit tests for `python-abp` are located in the `/tests` directory. |
81 [Pytest](http://pytest.org/) is used for quickly running the tests | 194 [Pytest][3] is used for quickly running the tests |
82 during development. | 195 during development. |
83 [Tox](https://tox.readthedocs.org/) is used for testing in different | 196 [Tox][4] is used for testing in different |
84 environments (Python 2.7, Python 3.5 and PyPy) and code quality | 197 environments (Python 2.7, 3.5, 3.6 and PyPy) and code quality |
85 reporting. | 198 reporting. |
86 | 199 |
87 In order to execute the tests, first create and activate development | 200 In order to execute the tests, first create and activate development |
88 virtualenv: | 201 virtualenv: |
89 | 202 |
90 $ python setup.py devenv | 203 $ python setup.py devenv |
91 $ . devenv/bin/activate | 204 $ . devenv/bin/activate |
92 | 205 |
93 With the development virtualenv activated use pytest for a quick test run: | 206 With the development virtualenv activated use pytest for a quick test run: |
94 | 207 |
95 (devenv) $ py.test tests | 208 (devenv) $ pytest tests |
96 | 209 |
97 and tox for a comprehensive report: | 210 and tox for a comprehensive report: |
98 | 211 |
99 (devenv) $ tox | 212 (devenv) $ tox |
| 213 |
| 214 |
| 215 [1]: https://adblockplus.org/filters#special-comments |
| 216 [2]: https://adblockplus.org/filters#options |
| 217 [3]: http://pytest.org/ |
| 218 [4]: https://tox.readthedocs.org/ |
OLD | NEW |