LEFT | RIGHT |
1 # python-abp | 1 # python-abp |
2 | 2 |
3 This repository contains a library for working with Adblock Plus filter lists | 3 This repository contains a library for working with Adblock Plus filter lists |
4 and the script that is used for building Adblock Plus filter lists from the | 4 and the script that is used for building Adblock Plus filter lists from the |
5 form in which they are authored into the format suitable for consumption by the | 5 form in which they are authored into the format suitable for consumption by the |
6 adblocking software. | 6 adblocking software. |
7 | 7 |
8 ## Installation | 8 ## Installation |
9 | 9 |
10 Prerequisites: | 10 Prerequisites: |
11 | 11 |
12 * Linux, Mac OS X or Windows (any modern Unix should work too), | 12 * Linux, Mac OS X or Windows (any modern Unix should work too), |
13 * Python (2.7 or 3.5, 3.6), | 13 * Python (2.7 or 3.5+), |
14 * pip. | 14 * pip. |
15 | 15 |
16 To install: | 16 To install: |
17 | 17 |
18 $ pip install -U python-abp | 18 $ pip install -U python-abp |
19 | 19 |
20 ## Rendering of filter lists | 20 ## Rendering of filter lists |
21 | 21 |
22 The filter lists are originally authored in relatively smaller parts focused | 22 The filter lists are originally authored in relatively smaller parts focused |
23 on a particular type of filters, related to a specific topic or relevant | 23 on a particular type of filters, related to a specific topic or relevant |
(...skipping 23 matching lines...) Expand all Loading... |
47 The first instruction contains a URL that will be fetched and inserted at the | 47 The first instruction contains a URL that will be fetched and inserted at the |
48 point of reference. | 48 point of reference. |
49 The second one contains a path inside easylist repository. | 49 The second one contains a path inside easylist repository. |
50 `flrender` needs to be able to find a copy of the repository on the local | 50 `flrender` needs to be able to find a copy of the repository on the local |
51 filesystem. We use `-i` option to point it to to the right directory: | 51 filesystem. We use `-i` option to point it to to the right directory: |
52 | 52 |
53 $ flrender -i easylist=/home/abc/easylist input.txt output.txt | 53 $ flrender -i easylist=/home/abc/easylist input.txt output.txt |
54 | 54 |
55 Now the second reference above will be resolved to | 55 Now the second reference above will be resolved to |
56 `/home/abc/easylist/easylist/easylist_general_block.txt` and the fragment will | 56 `/home/abc/easylist/easylist/easylist_general_block.txt` and the fragment will |
57 be read from this file. | 57 be loaded from this file. |
58 | 58 |
59 Directories that contain filter list fragments that are used during rendering | 59 Directories that contain filter list fragments that are used during rendering |
60 are called sources. | 60 are called sources. |
61 They are normally working copies of the repositories that contain filter list | 61 They are normally working copies of the repositories that contain filter list |
62 fragments. | 62 fragments. |
63 Each source is identified by a name: that's the part that comes before ":" | 63 Each source is identified by a name: that's the part that comes before ":" |
64 in the include instruction and it should be the same as what comes before "=" | 64 in the include instruction and it should be the same as what comes before "=" |
65 in the `-i` option. | 65 in the `-i` option. |
66 | 66 |
67 Commonly used sources have generally accepted names. For example the main | 67 Commonly used sources have generally accepted names. For example the main |
68 EasyList repository is referred to as `easylist`. | 68 EasyList repository is referred to as `easylist`. |
69 If you don't know all the source names that are needed to render some list, | 69 If you don't know all the source names that are needed to render some list, |
70 just run `flrender` and it will report what it's missing: | 70 just run `flrender` and it will report what it's missing: |
71 | 71 |
72 $ flrender easylist.txt output/easylist.txt | 72 $ flrender easylist.txt output/easylist.txt |
73 Unknown source: 'easylist' when including 'easylist:easylist/easylist_gener | 73 Unknown source: 'easylist' when including 'easylist:easylist/easylist_gener |
74 al_block.txt' from 'easylist.txt' | 74 al_block.txt' from 'easylist.txt' |
75 | 75 |
76 You can clone the necessary repositories to a local directory and add `-i` | 76 You can clone the necessary repositories to a local directory and add `-i` |
77 options accordingly. | 77 options accordingly. |
78 | 78 |
79 ## Library API | 79 ## Library API |
80 | 80 |
81 Python-abp can also be used as a library for parsing filter lists. For example | 81 Python-abp can also be used as a library for parsing filter lists. For example |
82 to read a filter list (we use Python 3 syntax here but the API is the same): | 82 to read a filter list (we use Python 3 syntax here but the API is the same): |
83 | 83 |
84 from abp.filter import parse_filterlist | 84 from abp.filters import parse_filterlist |
85 | 85 |
86 with open('filterlist.txt') as filterlist: | 86 with open('filterlist.txt') as filterlist: |
87 for line in parse_filterlist(filterlist): | 87 for line in parse_filterlist(filterlist): |
88 print(line) | 88 print(line) |
89 | 89 |
90 If `filterlist.txt` contains a filter list, the output will look similar to | 90 If `filterlist.txt` contains a filter list: |
91 the following: | 91 |
| 92 [Adblock Plus 2.0] |
| 93 ! Title: Example list |
| 94 |
| 95 abc.com,cdf.com##div#ad1 |
| 96 abc.com/ad$image |
| 97 @@/abc\.com/ |
| 98 ... |
| 99 |
| 100 the output will look something like: |
92 | 101 |
93 Header(version='Adblock Plus 2.0') | 102 Header(version='Adblock Plus 2.0') |
94 Metadata(key='Title', value='Example List') | 103 Metadata(key='Title', value='Example list') |
95 EmptyLine() | 104 EmptyLine() |
96 Filter(text='abc.com,cdf.com##div#ad1', selector={'type': 'css', 'value': | 105 Filter(text='abc.com,cdf.com##div#ad1', selector={'type': 'css', 'value': 'd
iv#ad1'}, action='hide', options=[('domain', [('abc .com', True), ('cdf.com', Tr
ue)])]) |
97 'div#ad1'}, action='hide', options={'domains-include': ['abc.com', | 106 Filter(text='abc.com/ad$image', selector={'type': 'url-pattern', 'value': 'a
bc.com/ad'}, action='block', options=[('image', True)]) |
98 'cdf.com'], 'domains-none': True}) | 107 Filter(text='@@/abc\\.com/', selector={'type': 'url-regexp', 'value': 'abc\\
.com'}, action='allow', options=[]) |
99 Filter(text='abc.com/ad$image', selector={'type': 'url-pattern', 'value': | |
100 'abc.com/ad'}, action='block', options={'types-none': True, | |
101 'types-include': ['image']}) | |
102 Filter(text='@@/abc\\.com/', selector={'type': 'url-regexp', 'value': | |
103 'abc\\.com'}, action='allow', options={}) | |
104 ... | 108 ... |
105 | 109 |
106 In general `parse_filterlist` takes an iterable of strings (such as a list or | 110 `abp.filters` module also exports a lower-level function for parsing individual |
107 an open file) and returns an iterable of parsed filter list lines. Each line | 111 lines of a filter list: `parse_line`. It returns a parsed line object just like |
108 will have its `.type` attribute set to a string indicating its type. It will | 112 the items in the iterator returned by `parse_filterlist`. |
109 also have a `.to_string()` method that converts it to a unicode string in the | |
110 filter list format (most of the time it's the same as the string from which the | |
111 filter was parsed). Further attributes depend on the type of the line. | |
112 | 113 |
113 **Note:** `parse_filterlist` returns an iterator, not a list, and only consumes | 114 For further information on the library API use `help()` on `abp.filters` and |
114 the input lines when its output is iterated over. This allows much more memory | 115 its contents in interactive Python session, read the docstrings or look at the |
115 efficient handling of large filter lists, however there are two things to watch | 116 tests for some usage examples. |
116 out for: | |
117 | |
118 - When you're parsing filters from a file, you need to complete the iteration | |
119 before you close the file. | |
120 - Once you iterate over the output of `parse_filterlist` once, it will be | |
121 consumed and you won't be iterate over it again. | |
122 | |
123 If you find that any of these issues is bothering you, you probably want to | |
124 convert the output of `parse_filterlist` to a list: | |
125 | |
126 lines_list = list(parse_filterlist(filterlist)) | |
127 | |
128 This will load the whole file into memory but unless you're dealing with a | |
129 gigantic filter list that should not be a problem. | |
130 | |
131 ### Line types | |
132 | |
133 As mentioned before, lines of different types have different attributes: | |
134 | |
135 | type | attributes
| | |
136 |------------|------------------------------------------------------------------
------| | |
137 | header | `version` - plugin version string
| | |
138 | emptyline | no options
| | |
139 | comment | `text` - text of the comment
| | |
140 | metadata | `key` - name of the metadata field, `value` - value of the field
| | |
141 | include | `target` - url/path of the file to include
| | |
142 | invalid | `text` - full text of the line, error - error message
| | |
143 | filter | `text` - text of the filter, `selector` - what to look for, `acti
on` - what to do with selected items, `options` - filter options | | |
144 | |
145 #### Filter atributes | |
146 | |
147 Selector is a dictionary with two keys: | |
148 | |
149 | key | meaning | | |
150 |--------------|----------------------------------------------------| | |
151 | type | 'css', 'abp-simple', 'url-pattern', 'url-regexp' | | |
152 | value | the selector itself, the meaning is type-dependent | | |
153 | |
154 Options is a dictionary with a variable set of keys. Only options that are | |
155 actually present in the filter will be stored there. The list of possible option
s | |
156 and their meanings can be found in [documentation on authoring the filter | |
157 rules][2]. | |
158 | |
159 There are four classes of options that are handled differently: | |
160 | |
161 - Type options (that make the rule apply or not apply to certain types of | |
162 requests and resources): | |
163 - `types-include`: List of additional types to which the rule applies. | |
164 - `types-exclude`: List of types to which the rule doesn't apply. | |
165 - `types-none`: If this is `True`, the filter only applies to the types | |
166 in `types-include`. Otherwise all types except for `document`, `popup`, | |
167 `elemhide`, `generichide` and `genericblock` are implicitly included. | |
168 - Domain options (that make the rule apply or not apply to specific domains): | |
169 - `domains-include`: List of domains to which the rule applies (it will also | |
170 apply to any subdomains unless they are excluded). | |
171 - `domains-exclude`: Excluded domains (their subdomains are also excluded | |
172 unless specifically included). | |
173 - `domains-none`: If this is `True`, all domains that are not mentioned by | |
174 `domains-include` and `domains-exclude` are excluded. Otherwise they are | |
175 included. | |
176 - `sitekeys`: List of sitekeys that can be used to activate the rule. | |
177 - Flags: `third-party`, `collapse`, `match-case`, etc. See [documentation][2] | |
178 for more information on their meaning. | |
179 | |
180 ### Other functions | |
181 | |
182 `abp.filters` module also exports two lower-level functions for parsing | |
183 individual lines of filter list or individual filters. Not very surprisingly | |
184 they are called `parse_line` and `parse_filter` respectively. Both will return | |
185 a parsed line object just like the items in the iterator returned by | |
186 `parse_filterlist`. The difference between them is that `parse_line` tries to | |
187 do line type detection and `parse_filter` will always try to interpret things | |
188 as a filter. Both functions will throw a `ParseError` exception instead of | |
189 returning a line with `type="invalid"`. | |
190 | 117 |
191 ## Testing | 118 ## Testing |
192 | 119 |
193 Unit tests for `python-abp` are located in the `/tests` directory. | 120 Unit tests for `python-abp` are located in the `/tests` directory. |
194 [Pytest][3] is used for quickly running the tests | 121 [Pytest][3] is used for quickly running the tests |
195 during development. | 122 during development. |
196 [Tox][4] is used for testing in different | 123 [Tox][4] is used for testing in different |
197 environments (Python 2.7, 3.5, 3.6 and PyPy) and code quality | 124 environments (Python 2.7, Python 3.5+ and PyPy) and code quality |
198 reporting. | 125 reporting. |
199 | 126 |
200 In order to execute the tests, first create and activate development | 127 In order to execute the tests, first create and activate development |
201 virtualenv: | 128 virtualenv: |
202 | 129 |
203 $ python setup.py devenv | 130 $ python setup.py devenv |
204 $ . devenv/bin/activate | 131 $ . devenv/bin/activate |
205 | 132 |
206 With the development virtualenv activated use pytest for a quick test run: | 133 With the development virtualenv activated use pytest for a quick test run: |
207 | 134 |
208 (devenv) $ pytest tests | 135 (devenv) $ pytest tests |
209 | 136 |
210 and tox for a comprehensive report: | 137 and tox for a comprehensive report: |
211 | 138 |
212 (devenv) $ tox | 139 (devenv) $ tox |
213 | 140 |
| 141 ## Development |
| 142 |
| 143 When adding new functionality, add tests for it (preferably first). Code |
| 144 coverage (as measured by `tox -e qa`) should not decrease and the tests |
| 145 should pass in all Tox environments. |
| 146 |
| 147 All public functions, classes and methods should have docstrings compliant with |
| 148 [NumPy/SciPy documentation guide][5]. One exception is the constructors of |
| 149 classes that the user is not expected to instantiate (such as exceptions). |
214 | 150 |
215 [1]: https://adblockplus.org/filters#special-comments | 151 [1]: https://adblockplus.org/filters#special-comments |
216 [2]: https://adblockplus.org/filters#options | 152 [2]: https://adblockplus.org/filters#options |
217 [3]: http://pytest.org/ | 153 [3]: http://pytest.org/ |
218 [4]: https://tox.readthedocs.org/ | 154 [4]: https://tox.readthedocs.org/ |
| 155 [5]: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt |
LEFT | RIGHT |