Previously, I introduced the concept of Scrapy add-ons and how it will improve the experience of both users and developers. Users will have a single entry-point to enabling and configuring add-ons without being required to learn about Scrapy’s internal settings structure. Developers will gain better control over enforcing and checking proper configuration of their Scrapy extensions. Additional to their extension, they can provide a Scrapy add-on. An add-on is any Python object that provides the add-on interface. The interface, in turn, consists of few descriptive variables (name, version, …) and two callbacks: One for enforcing configuration, called before the initialisation of Scrapy’s crawler, and one for post-init checks, called immediately before crawling begins. This post describes the current state of and issues with the implementation of add-on management in Scrapy.

Current state

The pull request with the current work-in-progress on the implementation can be found on GitHub. Besides a lot of infrastructure (base classes, interfaces, helper functions, tests), its heart is the AddonManager. The add-on manager ‘holds’ all loaded add-ons and has methods to load configuration files, add add-ons, and check dependency issues. Furthermore, it is the entry point for calling the add-ons’ callbacks. The ‘loading’ and ‘holding’ part can be used independently of one another, but in my eyes there are too many cross-dependencies for the ‘normal’ intended usage to justify separating them into two classes.

Two “single” entry points?

From a user’s perspective, Scrapy settings are controlled from two configuration files: scrapy.cfg and settings.py. This distinction is not some historical-backwards-compatible leftover, but has a sensible reason: Scrapy uses projects as organisational structure. All spiders, extensions, declarations of what can be scraped, etc. live in a Scrapy project. Every project has settings.py in which crawling-related settings are stored. However, there are other settings that can or should not live in settings.py. This (obviously) includes the path to settings.py (for ease of understanding, I will always write settings.py for the settings module, although it can be any Python module), and settings that are not bound to a particular project. Most prominently, Scrapyd, an application for deploying and running Scrapy spiders, uses scrapy.cfg to store information on deployment targets (i.e. the address and auth info for the server you want to deploy your Scrapy spiders to).

Now, add-ons are bound to a project as much as crawling settings are. Consequentially, add-on configuration should therefore live in settings.py. However, Python is a programming language, and not a standard for configuration files, and its syntax is therefore (for the purpose of configuration) less user-friendly. An ini configuration like this:

# In scrapy.cfg

[addon:path.to.mysql_pipe]
database = some.server
user = some_user
password = some!password

would (could) look similar to this in Python syntax:

# In settings.py

addon_mysqlpipe = dict(
    _name = 'path.to.mysql_pipe',
    database = 'some.server',
    user = 'some_user',
    password = 'some!password',
    )

While I much prefer the first version, putting add-on configuration into scrapy.cfg would be very inconsistent with the previous distinction of the two configuration files. It will therefore probably end up in settings.py. The syntax is a little less user-friendly, but after all, most Scrapy users should be familiar with Python. For now, I have decided to write code that reads from both.

Allowing add-ons to load and configure other add-ons

In some cases, it might be helpful if add-ons were allowed to load and configure other add-ons. For example, there might be ‘umbrella add-ons’ that decide what subordinate add-ons need to be enabled and configured given some configuration values. Or an add-on might depend on some other add-on being configured in a specific way. The big issue with this is that, with the current implementation, the first time the methods of an add-ons are called is during the first round of callbacks to update_settings(). Should an add-on load or reconfigure another add-on here, other add-ons might already have been called. While it is possible to ensure that the update_settings() method of the newly added add-on is called, there is no guarantee (and in fact, it is quite unlikely) that all add-ons see the same add-on configuration in their update_settings().

I see three possible approaches to this:

  1. Forbid add-ons from loading or configuring other add-ons. In this case ‘umbrella add-ons’ would not be possible and all cross-configuration dependencies would again be burdened onto the user.
  2. Forbid add-ons to do any kind of settings introspection in update_settings(), instead only allow them to do changes to the settings object or load other add-ons. In this case, configuring already enabled add-ons should be avoided, as there is no guarantee that their update_settings() method has not already been called
  3. Add a third callback, update_addons(config, addonmgr), to the add-on interface. Only loading and configuring other add-ons should be done in this method. While it may be allowed, developers should be aware that depending on the config (of their own add-on, i.e. the one whose update_addons() is currently called) is fragile as, once again, there is no guarantee in which order add-ons will be called back.

I have put too much thought into it just yet, but I think I prefer option 3.