Index: README.md |
=================================================================== |
--- a/README.md |
+++ b/README.md |
@@ -1,20 +1,21 @@ |
# python-abp |
-This repository contains the script that is used for building Adblock Plus |
-filter lists from the form in which they are authored into the format suitable |
-for consumption by the adblocking software. |
+This repository contains a library for working with Adblock Plus filter lists |
+and the script that is used for building Adblock Plus filter lists from the |
+form in which they are authored into the format suitable for consumption by the |
+adblocking software. |
## Installation |
Prerequisites: |
* Linux, Mac OS X or Windows (any modern Unix should work too), |
-* Python (2.7 or 3.5), |
+* Python (2.7 or 3.5, 3.6), |
* pip. |
To install: |
$ pip install -U python-abp |
## Rendering of filter lists |
@@ -23,30 +24,30 @@ |
for particular geographical area. |
We call these parts _filter list fragments_ (or just _fragments_) |
to distinguish them from full filter lists that are |
consumed by the adblocking software such as Adblock Plus. |
Rendering is a process that combines filter list fragments into a filter list. |
It starts with one fragment that can include other ones and so forth. |
The produced filter list is marked with a version, a timestamp and |
-a [checksum](https://adblockplus.org/filters#special-comments). |
+a [checksum][1]. |
Python-abp contains a script that can do this called `flrender`: |
$ flrender fragment.txt filterlist.txt |
This will take the top level fragment in `fragment.txt`, render it and save into |
`filterlist.txt`. |
Fragments might reference other fragments that should be included into them. |
The references come in two forms: http(s) includes and local includes: |
%include http://www.server.org/dir/list.txt% |
- %include easylist:easylist/easylist_general_block.txt |
+ %include easylist:easylist/easylist_general_block.txt% |
The first instruction contains a URL that will be fetched and inserted at the |
point of reference. |
The second one contains a path inside easylist repository. |
`flrender` needs to be able to find a copy of the repository on the local |
filesystem. We use `-i` option to point it to to the right directory: |
$ flrender -i easylist=/home/abc/easylist input.txt output.txt |
@@ -70,30 +71,148 @@ |
$ flrender easylist.txt output/easylist.txt |
Unknown source: 'easylist' when including 'easylist:easylist/easylist_gener |
al_block.txt' from 'easylist.txt' |
You can clone the necessary repositories to a local directory and add `-i` |
options accordingly. |
+## Library API |
+ |
+Python-abp can also be used as a library for parsing filter lists. For example |
+to read a filter list (we use Python 3 syntax here but the API is the same): |
+ |
+ from abp.filter import parse_filterlist |
+ |
+ with open('filterlist.txt') as filterlist: |
+ for line in parse_filterlist(filterlist): |
+ print(line) |
+ |
+If `filterlist.txt` contains a filter list, the output will look similar to |
+the following: |
+ |
+ Header(version='Adblock Plus 2.0') |
+ Metadata(key='Title', value='Example List') |
+ EmptyLine() |
+ Filter(text='abc.com,cdf.com##div#ad1', selector={'type': 'css', 'value': |
+ 'div#ad1'}, action='hide', options={'domains-include': ['abc.com', |
+ 'cdf.com'], 'domains-none': True}) |
+ Filter(text='abc.com/ad$image', selector={'type': 'url-pattern', 'value': |
+ 'abc.com/ad'}, action='block', options={'types-none': True, |
+ 'types-include': ['image']}) |
+ Filter(text='@@/abc\\.com/', selector={'type': 'url-regexp', 'value': |
+ 'abc\\.com'}, action='allow', options={}) |
+ ... |
+ |
+In general `parse_filterlist` takes an iterable of strings (such as a list or |
+an open file) and returns an iterable of parsed filter list lines. Each line |
+will have its `.type` attribute set to a string indicating its type. It will |
+also have a `.to_string()` method that converts it to a unicode string in the |
+filter list format (most of the time it's the same as the string from which the |
+filter was parsed). Further attributes depend on the type of the line. |
+ |
+**Note:** `parse_filterlist` returns an iterator, not a list, and only consumes |
+the input lines when its output is iterated over. This allows much more memory |
+efficient handling of large filter lists, however there are two things to watch |
+out for: |
+ |
+- When you're parsing filters from a file, you need to complete the iteration |
+ before you close the file. |
+- Once you iterate over the output of `parse_filterlist` once, it will be |
+ consumed and you won't be iterate over it again. |
+ |
+If you find that any of these issues is bothering you, you probably want to |
+convert the output of `parse_filterlist` to a list: |
+ |
+ lines_list = list(parse_filterlist(filterlist)) |
+ |
+This will load the whole file into memory but unless you're dealing with a |
+gigantic filter list that should not be a problem. |
+ |
+### Line types |
+ |
+As mentioned before, lines of different types have different attributes: |
+ |
+| type | attributes | |
+|------------|------------------------------------------------------------------------| |
+| header | `version` - plugin version string | |
+| emptyline | no options | |
+| comment | `text` - text of the comment | |
+| metadata | `key` - name of the metadata field, `value` - value of the field | |
+| include | `target` - url/path of the file to include | |
+| invalid | `text` - full text of the line, error - error message | |
+| filter | `text` - text of the filter, `selector` - what to look for, `action` - what to do with selected items, `options` - filter options | |
+ |
+#### Filter atributes |
+ |
+Selector is a dictionary with two keys: |
+ |
+| key | meaning | |
+|--------------|----------------------------------------------------| |
+| type | 'css', 'abp-simple', 'url-pattern', 'url-regexp' | |
+| value | the selector itself, the meaning is type-dependent | |
+ |
+Options is a dictionary with a variable set of keys. Only options that are |
+actually present in the filter will be stored there. The list of possible options |
+and their meanings can be found in [documentation on authoring the filter |
+rules][2]. |
+ |
+There are four classes of options that are handled differently: |
+ |
+- Type options (that make the rule apply or not apply to certain types of |
+ requests and resources): |
+ - `types-include`: List of additional types to which the rule applies. |
+ - `types-exclude`: List of types to which the rule doesn't apply. |
+ - `types-none`: If this is `True`, the filter only applies to the types |
+ in `types-include`. Otherwise all types except for `document`, `popup`, |
+ `elemhide`, `generichide` and `genericblock` are implicitly included. |
+- Domain options (that make the rule apply or not apply to specific domains): |
+ - `domains-include`: List of domains to which the rule applies (it will also |
+ apply to any subdomains unless they are excluded). |
+ - `domains-exclude`: Excluded domains (their subdomains are also excluded |
+ unless specifically included). |
+ - `domains-none`: If this is `True`, all domains that are not mentioned by |
+ `domains-include` and `domains-exclude` are excluded. Otherwise they are |
+ included. |
+- `sitekeys`: List of sitekeys that can be used to activate the rule. |
+- Flags: `third-party`, `collapse`, `match-case`, etc. See [documentation][2] |
+ for more information on their meaning. |
+ |
+### Other functions |
+ |
+`abp.filters` module also exports two lower-level functions for parsing |
+individual lines of filter list or individual filters. Not very surprisingly |
+they are called `parse_line` and `parse_filter` respectively. Both will return |
+a parsed line object just like the items in the iterator returned by |
+`parse_filterlist`. The difference between them is that `parse_line` tries to |
+do line type detection and `parse_filter` will always try to interpret things |
+as a filter. Both functions will throw a `ParseError` exception instead of |
+returning a line with `type="invalid"`. |
+ |
## Testing |
Unit tests for `python-abp` are located in the `/tests` directory. |
-[Pytest](http://pytest.org/) is used for quickly running the tests |
+[Pytest][3] is used for quickly running the tests |
during development. |
-[Tox](https://tox.readthedocs.org/) is used for testing in different |
-environments (Python 2.7, Python 3.5 and PyPy) and code quality |
+[Tox][4] is used for testing in different |
+environments (Python 2.7, 3.5, 3.6 and PyPy) and code quality |
reporting. |
In order to execute the tests, first create and activate development |
virtualenv: |
$ python setup.py devenv |
$ . devenv/bin/activate |
With the development virtualenv activated use pytest for a quick test run: |
- (devenv) $ py.test tests |
+ (devenv) $ pytest tests |
and tox for a comprehensive report: |
(devenv) $ tox |
+ |
+ |
+ [1]: https://adblockplus.org/filters#special-comments |
+ [2]: https://adblockplus.org/filters#options |
+ [3]: http://pytest.org/ |
+ [4]: https://tox.readthedocs.org/ |