Hi! My name’s Jakob, and I was lucky enough to be selected for this year’s Google Summer of Code. Here, I will blog about my progress and about the journey of making my first larger contribution to the open-source community.
This first post serves as an introduction to the software I will be working on and the motivation behind my project for the uninitiated. Future posts will probably be go much deeper into technical detail and require a certain familiarity with Scrapy and its code base.
Ever caught yourself being annoyed that a website you want to read data from does not provide an API, or provides a crippled one? Meet Scrapy! An open source Python framework “for extracting the data you need from websites. In a fast, simple, yet extensible way.”
Scrapy crawls webpages from a given start point (or from multiple points), following the links you want it to follow. Every retrieved website is parsed and given to your Spider (that is the part that you’ll have to write), where you select data and gather it into Items. The items are then further processed, filtered, and exported into files / feeds / databases / whatever you desire. Take a look at Scrapy at a Glance if you’re not familiar with it yet.
Its many hooks and extension mechanisms give Scrapy the flexibility to adapt to almost any task that involves automated crawling or data extraction. You can mess with the HTTP requests and responses, implement or change how different file types and transfer protocols are handled, replace core components of the framework, easily integrate it into larger applications, decide what data should be saved where and how in great detail, etc. etc.
This flexibility necessarily comes at the price of a certain complexity. The many hooks are controlled through many settings variables, multiple of which may have to be edited in a coordinated fashion for larger extensions. Currently, the user is burdened with changing all settings to enable/disable extensions, meeting dependencies, and supervising possible extension interplay. Correspondingly, extension developers have very little control over ensuring their library dependencies and configuration requirements are met.
My project: Simplified Scrapy Add-ons
My Summer of Code project aims at improving both user and developer experience by implementing a simplified interface to managing Scrapy extensions. When implemented, extension management will be closer to “plug and play”. Users will be able to enable and configure extensions in a simple and intuitive manner, while developers will gain easier control over configuration and dependency management.
An additional component, the add-on manager, will be added to the framework. This manager provides a single entry point where users can enable/disable and configure extensions. While they may supply more, users will only need to give the bare minimum extension configuration (not Scrapy configuration), such as database passwords. This configuration is handed over to the extension itself, where developers are supplied with mechanisms to impose their required (Scrapy) configuration settings and to check and resolve their dependencies. Just before the Scrapy engine begins crawling, all extensions will be allowed to perform post-initialisation tests to enforce proper configuration and avoid unwanted extension interplay.
A more extended motivation and many technical details of the planned changes to Scrapy’s code base, as well as my proposed timeline, can be found in my GSoC proposal.