Last time, we learned that most Scrapy extension hooks are controlled via dictionary-like settings variables. We allowed updating these settings from different places without having to worry about order by extending Scrapy’s priority-based settings system to dictionaries. The corresponding pull request is ready for final review by now and includes complete tests and documentation. Now that this is (almost) out of the way, how can we “[improve] both user and developer experience by implementing a simplified interface to managing Scrapy extensions”, as I promised in my initial blog post?

The Concept of Add-ons

Often, extension developers will provide their users with small manuals that show which settings they need to modify in which way. The idea behind add-ons is to provide developers with mechanisms allowing them to apply these basic settings themselves. The user, on the other hand, no longer needs to understand Scrapy’s internal structure. Instead, she only needs “plug in” the add-on at unified single entry point, possibly through a single line. If necessary, she can also configure the add-on at this entry point, e.g. to supply database credentials.

Let us assume that we have a simple pipeline that saves items into a MySQL database. Currently, the user has to configure her settings.py file similar to this:

# In settings.py

ITEM_PIPELINES = {
    # Possible further pipelines here
    'myproject.pipelines.mysql_pipe': 0,
    }

MYSQL_DB = 'some.server'
MYSQL_USER = 'some_user'
MYSQL_PASSWORD = 'some!password'

This has several shortfalls:

  • the user is required to either edit settings blindly (Why ITEM_PIPELINES? What does the 0 mean?), or learn about Scrapy internals
  • all settings are exposed into the global settings namespace, creating potential for name clashes
  • the add-on developer has no option to check proper for dependencies and proper configuration

With the add-on system, the user experience would be closer to this:

# In scrapy.cfg

[addon:mysql_pipe]
database = some.server
user = some_user
password = some!password

Note that:

  • Scrapy’s internals (ITEM_PIPELINES, 0) are hidden
  • Specifying a complete Python path (myproject.pipelines.mysql_pipe) is no longer necessary
  • The database credentials are no longer independent settings, but local to the add-on section

Add-ons from a Developer’s Point of View

With the add-on system, developers gain greater control over Scrapy’s configuration. All they have to do is write a (any!) Python object that implements Scrapy’s add-on interface. The interface could be provided in a Python module, separate class, or along the extension class they wrote. The interface consists of two attributes and two callbacks:

  • NAME: String with human-readable add-on name
  • VERSION: tuple containing major/minor/patchlevel version of the add-on
  • update_settings()
  • check_configuration()

While the two attributes can be used for dependency management (e.g. “My add-on needs add-on X > 1.1.0”), the two callbacks are where developers gain control over Scrapy’s settings, freeing them from relying on their users to properly follow their configuration manuals. In update_settings(), the add-on receives its (local) configuration from scrapy.cfg and the Scrapy Settings object. It can then internally configure the extensions and expose settings into the global namespace as it seems fit. The second callback, check_configuration(), is called after Scrapy’s crawler is fully initialised, and should be used for dependency checks and post-init tests.

Current State

So far, I have redrafted an existing Scrapy Extension Proposal (SEP) with an outline of the add-on implementation. Code-wise, I have already written loaders that read add-on configuration from Scrapy’s config files, then search and initialise the add-on objects.

Where exactly the add-on objects should live is still up for debate. Currently, I plan on writing a small helper class that holds the add-on objects and provides helpers to access their attributes. This ‘holder’ would then live on the crawler, which is Scrapy’s central entry point object for all extensions and which manages the crawling process.

You can follow my progress in my Add-ons pull request.