In my earlier posts, I have talked mostly about the motivation and the internal implementation of Scrapy’s add-on system. Here, I want to talk about how the add-on framework looks in action, i.e. how it actually effects the users’ and developers’ experience. We will see how users are able to configure built-in and third-party components without worrying about Scrapy’s internal structure, and how developer’s can check and enforce requirements for their extensions. This blog entry will therefore probably feel a little like a documentation page, and indeed I hope that I can reuse some of it for the official Scrapy docs.

From a user’s perspective

To enable an add-on, all you need to do is provide its path and, if necessary, its configuration to Scrapy. There are two ways to do this:

  • via the ADDONS setting, and
  • via the scrapy.cfg file.

As Scrapy settings can be modified from many places, e.g. in a project’s settings.py, in a Spider’s custom_settings attribute, or from the command line, using the ADDONS setting is the preferred way to manage add-ons.

The ADDONS setting is a dictionary in which every key is the path to an add-on. The corresponding value is a (possibly empty) dictionary, containing the add-on configuration. While more precise, it is not necessary to specify the full add-on Python path if it is either built into Scrapy or lives in your project’s addons submodule.

This is an example where an internal add-on and a third-party add-on (in this case one requiring no configuration) are enabled/configured in a project’s settings.py:

ADDONS = {
    'httpcache': {
        'expiration_secs': 60,
        'ignore_http_codes': [404, 405],
    },
    'path.to.some.addon': {},
}

It is also possible to manage add-ons from scrapy.cfg. While the syntax is a little friendlier, be aware that this file, and therefore the configuration in it, is not bound to a particular Scrapy project. While this should not pose a problem when you use the project on your development machine only, a common stumbling block is that scrapy.cfg is not deployed via scrapyd-deploy.

In scrapy.cfg, section names, prepended with addon:, replace the dictionary keys. I.e., the configuration from above would look like this:

[addon:httpcache]
expiration_secs = 60
ignore_http_codes = 404,405

[addon:path.to.some.addon]

From a developer’s perspective

Add-ons are (any) Python objects that provide Scrapy’s add-on interface. The interface is enforced through zope.interface. This leaves the choice of Python object up the developer. Examples:

  • for a small pipeline, the add-on interface could be implemented in the same class that also implements the open/close_spider and process_item callbacks
  • for larger add-ons, or for clearer structure, the interface could be provided by a stand-alone module

The absolute minimum interface consists of two attributes:

  • name: string with add-on name
  • version: version string (PEP-404, e.g. '1.0.1')

Of course, stating just these two attributes will not get you very far. Add-ons can provide three callback methods that are called at various stages before the crawling process:

update_settings(config, settings)

This method is called during the initialization of the crawler. Here, you should perform dependency checks (e.g. for external Python libraries) and update the settings object as wished, e.g. enable components for this add-on or set required configuration of other extensions.

check_configuration(config, crawler)

This method is called when the crawler has been fully initialized, immediately before it starts crawling. You can perform additional dependency and configuration checks here.

update_addons(config, addons)

This method is called immediately before update_settings(), and should be used to enable and configure other add-ons only.

When using this callback, be aware that there is no guarantee in which order the update_addon() callbacks of enabled add-ons will be called. Add-ons that are added to the add-on manager during this callback will also have their update_addons() method called.

Additionally, add-ons may (and should, where appropriate) provide one or more attributes that can be used for limited automated detection of possible dependency clashes:

  • requires: list of built-in or custom components needed by this add-on, as strings

  • modifies: list of built-in or custom components whose functionality is affected or replaced by this add-on (a custom HTTP cache should list httpcache here)

  • provides: list of components provided by this add-on (e.g. mongodb for an extension that provides generic read/write access to a MongoDB database)

Some example add-ons

The main advantage of add-ons is that developers gain better control over how and in what conditions their Scrapy extensions are deployed. For example, it is now easy to check for external libraries and have the crawler shut down gracefully if they are not available:

class MyAddon(object):
    name = 'myaddon'
    version = '1.0'

    def update_settings(self, config, settings):
        try:
            import boto
        except ImportError:
            raise RuntimeError("boto library is required")
        else:
            # Perform configuration

Or, to avoid unwanted interplay with other extensions and add-ons, or the user, it is now also easy to check for misconfiguration in the final (final!) settings used to crawl:

class MyAddon(object):
    name = 'myaddon'
    version = '1.0'

    def update_settings(self, config, settings):
        settings.set('DNSCACHE_ENABLED', False, priority='addon')

    def check_configuration(self, config, crawler):
        if crawler.settings.getbool('DNSCACHE_ENABLED'):
            # The spider, some other add-on, or the user messed with the
            # DNS cache setting
            raise ValueError("myaddon is incompatible with DNS cache")

Instead of depending on the user to activate components and than gather configuration the global settings name space on initialization, it becomes feasible to instantiate the components ad hoc:

from path.to.my.pipelines import MySQLPipeline

class MyAddon(object):
    name = 'myaddon'
    version = '1.0'

    def update_settings(self, config, settings):
        mysqlpl = MySQLPipeline(password=config['password'])
        settings.set(
            'ITEM_PIPELINES',
            {mysqlpl: 200},
            priority='addon',
        )

Often, it will not be necessary to write an additional class just to provide an add-on for your extension. Instead, you can simply provide the add-on interface alongside the component interface, e.g.:

class MyPipeline(object):
    name = 'mypipeline'
    version = '1.0'

    def process_item(self, item, spider):
        # Do some processing here
        return item
        
    def update_settings(self, config, settings):
        settings.set(
            'ITEM_PIPELINES',
            {self: 200},
            priority='addon',
        )