Previously, I introduced the concept of Scrapy add-ons and how it will improve the experience of both users and developers. Users will have a single entry-point to enabling and configuring add-ons without being required to learn about Scrapy’s internal settings structure. Developers will gain better control over enforcing and checking proper configuration of their Scrapy extensions. Additional to their extension, they can provide a Scrapy add-on. An add-on is any Python object that provides the add-on interface. The interface, in turn, consists of few descriptive variables (name, version, …) and two callbacks: One for enforcing configuration, called before the initialisation of Scrapy’s crawler, and one for post-init checks, called immediately before crawling begins. This post describes the current state of and issues with the implementation of add-on management in Scrapy.
The pull request with the current work-in-progress on the implementation can be found on GitHub. Besides a lot of infrastructure (base classes, interfaces, helper functions, tests), its heart is the
AddonManager. The add-on manager ‘holds’ all loaded add-ons and has methods to load configuration files, add add-ons, and check dependency issues. Furthermore, it is the entry point for calling the add-ons’ callbacks. The ‘loading’ and ‘holding’ part can be used independently of one another, but in my eyes there are too many cross-dependencies for the ‘normal’ intended usage to justify separating them into two classes.
Two “single” entry points?
From a user’s perspective, Scrapy settings are controlled from two configuration files:
settings.py. This distinction is not some historical-backwards-compatible leftover, but has a sensible reason: Scrapy uses projects as organisational structure. All spiders, extensions, declarations of what can be scraped, etc. live in a Scrapy project. Every project has
settings.py in which crawling-related settings are stored. However, there are other settings that can or should not live in
settings.py. This (obviously) includes the path to
settings.py (for ease of understanding, I will always write
settings.py for the settings module, although it can be any Python module), and settings that are not bound to a particular project. Most prominently, Scrapyd, an application for deploying and running Scrapy spiders, uses
scrapy.cfg to store information on deployment targets (i.e. the address and auth info for the server you want to deploy your Scrapy spiders to).
Now, add-ons are bound to a project as much as crawling settings are. Consequentially, add-on configuration should therefore live in
settings.py. However, Python is a programming language, and not a standard for configuration files, and its syntax is therefore (for the purpose of configuration) less user-friendly. An
ini configuration like this:
# In scrapy.cfg [addon:path.to.mysql_pipe] database = some.server user = some_user password = some!password
would (could) look similar to this in Python syntax:
# In settings.py addon_mysqlpipe = dict( _name = 'path.to.mysql_pipe', database = 'some.server', user = 'some_user', password = 'some!password', )
While I much prefer the first version, putting add-on configuration into
scrapy.cfg would be very inconsistent with the previous distinction of the two configuration files. It will therefore probably end up in
settings.py. The syntax is a little less user-friendly, but after all, most Scrapy users should be familiar with Python. For now, I have decided to write code that reads from both.
Allowing add-ons to load and configure other add-ons
In some cases, it might be helpful if add-ons were allowed to load and configure other add-ons. For example, there might be ‘umbrella add-ons’ that decide what subordinate add-ons need to be enabled and configured given some configuration values. Or an add-on might depend on some other add-on being configured in a specific way. The big issue with this is that, with the current implementation, the first time the methods of an add-ons are called is during the first round of callbacks to
update_settings(). Should an add-on load or reconfigure another add-on here, other add-ons might already have been called. While it is possible to ensure that the
update_settings() method of the newly added add-on is called, there is no guarantee (and in fact, it is quite unlikely) that all add-ons see the same add-on configuration in their
I see three possible approaches to this:
- Forbid add-ons from loading or configuring other add-ons. In this case ‘umbrella add-ons’ would not be possible and all cross-configuration dependencies would again be burdened onto the user.
- Forbid add-ons to do any kind of settings introspection in
update_settings(), instead only allow them to do changes to the
settingsobject or load other add-ons. In this case, configuring already enabled add-ons should be avoided, as there is no guarantee that their
update_settings()method has not already been called
- Add a third callback,
update_addons(config, addonmgr), to the add-on interface. Only loading and configuring other add-ons should be done in this method. While it may be allowed, developers should be aware that depending on the config (of their own add-on, i.e. the one whose
update_addons()is currently called) is fragile as, once again, there is no guarantee in which order add-ons will be called back.
I have put too much thought into it just yet, but I think I prefer option 3.