In my previous post, I introduced the Scrapy framework and its use cases, and touched that it provides a broad variety of hooks and mechanisms to extend its functionality. My GSoC project is to ease the use of these mechanisms for both users and developers by providing an extra layer – the add-on system – to Scrapy’s extension management. Here, we take a look at Scrapy’s settings API, where extensions are configured, and introduce the first sub-project needed for the implementation of an add-on system.

Scrapy Settings

When you work with Scrapy, you typically organise your code in a Scrapy project. The project is a structured folder that holds your:

  • items (where you define the structure of the data you wish to extract),
  • spiders (where you parse web page content into items),
  • pipelines (where you filter and export items), and,
  • if you want, much more code (e.g. downloader middlewares if you want to mess with the requests Scrapy sends to web servers).

The project also contains a settings.py file. This file is the entry point to changing many of Scrapy’s internal settings. For example, you would change the user agent that is reported to web servers by defining the USER_AGENT setting, or throttle how fast Scrapy sends request to the same server by setting a DOWNLOAD_DELAY.

Setting Priorities

The settings.py file is not the only place where settings are defined. Some setting defaults are overwritten when you call a specific scrapy command, or you can temporarily overwrite settings by passing command line arguments.

To honour the precendence of settings set at different locations – command line arguments take higher precedence than settings.py, which in turn takes precedence over the default settings – without having to care in which order settings are read in the code, Julia Medina introduced settings priorities in last year’s Summer of Code.

Internally, settings are saved in an instance of a Settings class. This class maps keys to values, much like a dictionary. However, the Settings.set() method that is used to write to it has a priority keyword argument. When a setting is saved, the given priority is saved along with the value. On the next call to set(), the setting is overwritten if, and only if, the given priority exceeds or equals the priority of the already saved value. This frees Scrapy’s codebase form having to watch out for order when overwriting settings. If a setting is given via the command line, it is simply saved with a high priority, and then guaranteed to not be overwritten by a lower priority.

Compound Type Settings

Scrapy does not only know simple (non-compound) settings, such as the DOWNLOAD_DELAY or the USER_AGENT. In particular, many of the settings related to enabling extensions are dictionaries. Typically, the dictionary keys represent where an extension can be found, and the values determine in which order extensions should be called. For example, if you have two filter pipelines and an export pipeline, you can make sure the filters are called before exporting by enabling your pipelines in the following way:

# In settings.py

ITEM_PIPELINES = {
  'myproject.pipelines.filter_one': 0,
  'myproject.pipelines.filter_two': 10,
  'myproject.pipelines.mongodb_exporter': 20,
  }

There is a problem with settings priorities and the compound type settings: there are no per-key priorities. In other words, the whole dictionary has only a single priority. Moreover, the complete dictionary is overwritten instead of updated every time when the setting is written to. This has some unpleasant consequences. For example, if I want to temporarily disable the second filter (from above) via the command line, I cannot simply pass that settings update (encoded in JSON) via

scrapy crawl -s "ITEM_PIPELINES={'myproject.pipelines.filter_two': null}" myspider

as the ITEM_PIPELINES setting would be completely overwritten, such that the only entry it now holds is a disabled filter_two.

Introducing per-key priorities

As the add-ons will mostly update the dictionary-like settings, my first sub-project aims at updating these kinds of settings from multiple locations such that

  • the dictionary is not completely overwritten, but
  • new keys are inserted and their per-key priority saved
  • existing keys are updated, if the update priority is high enough.

This is achieved by promoting the (default) dictionary settings from dict instances to Settings instances. This requires:

  • completing the dictionary-like interface of the Settings class
  • rerouting set() calls for these settings to their update() methods

It has further benefits for the structure of the Scrapy settings. Previously, some of the dictionary settings had default settings, e.g. there were a larger number of DOWNLOADER_MIDDLEWARES enabled by default. To avoid that these are completely overwritten, and instead only updated, when users set their own DOWNLOADER_MIDDLEWARES in settings.py, these were outsourced into a DOWNLOADER_MIDDLEWARES_BASE setting. When the downloader middleware list is then compiled, the two dictionaries were merged (with higher priority given to the user-defined DOWNLOADER_MIDDLEWARES).

Implementing the per-key priorities for dictionary-like settings made this structure obsolete. The default downloader middlewares (and similar components) can now simply be saved in DOWNLOADER_MIDDLEWARES without fearing that they’re overwritten (unless specifically wanted) when the user sets their own.

The pull request for this sub-project will soon be finished. It includes code cleanup for the now-deprecated _BASE settings as well as fixing some previously existing inconsistencies (not all components could be disabled by setting their order to None) and can be found on github.