Core API¶
New in version 0.15.
This section documents the Scrapy core API, and it’s intended for developers of extensions and middlewares.
Crawler API¶
The main entry point to Scrapy API is the Crawler
object, passed to extensions through the from_crawler class method. This
object provides access to all Scrapy core components, and it’s the only way for
extensions to access them and hook their functionality into Scrapy.
The Extension Manager is responsible for loading and keeping track of installed
extensions and it’s configured through the EXTENSIONS setting which
contains a dictionary of all available extensions and their order similar to
how you configure the downloader middlewares.
-
class
scrapy.crawler.Crawler(spidercls, settings)¶ The Crawler object must be instantiated with a
scrapy.spiders.Spidersubclass and ascrapy.settings.Settingsobject.-
settings¶ The settings manager of this crawler.
This is used by extensions & middlewares to access the Scrapy settings of this crawler.
For an introduction on Scrapy settings see Settings.
For the API see
Settingsclass.
-
signals¶ The signals manager of this crawler.
This is used by extensions & middlewares to hook themselves into Scrapy functionality.
For an introduction on signals see Signals.
For the API see
SignalManagerclass.
-
stats¶ The stats collector of this crawler.
This is used from extensions & middlewares to record stats of their behaviour, or access stats collected by other extensions.
For an introduction on stats collection see Stats Collection.
For the API see
StatsCollectorclass.
-
extensions¶ The extension manager that keeps track of enabled extensions.
Most extensions won’t need to access this attribute.
For an introduction on extensions and a list of available extensions on Scrapy see Extensions.
-
engine¶ The execution engine, which coordinates the core crawling logic between the scheduler, downloader and spiders.
Some extension may want to access the Scrapy engine, to inspect or modify the downloader and scheduler behaviour, although this is an advanced use and this API is not yet stable.
-
spider¶ Spider currently being crawled. This is an instance of the spider class provided while constructing the crawler, and it is created after the arguments given in the
crawl()method.
-
crawl(*args, **kwargs)¶ Starts the crawler by instantiating its spider class with the given args and kwargs arguments, while setting the execution engine in motion.
Returns a deferred that is fired when the crawl is finished.
-
-
class
scrapy.crawler.CrawlerRunner(settings=None)¶ This is a convenient helper class that keeps track of, manages and runs crawlers inside an already setup Twisted reactor.
The CrawlerRunner object must be instantiated with a
Settingsobject.This class shouldn’t be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that manually handle the crawling process. See Run Scrapy from a script for an example.
-
crawl(crawler_or_spidercls, *args, **kwargs)¶ Run a crawler with the provided arguments.
It will call the given Crawler’s
crawl()method, while keeping track of it so it can be stopped later.If crawler_or_spidercls isn’t a
Crawlerinstance, this method will try to create one using this parameter as the spider class given to it.Returns a deferred that is fired when the crawling is finished.
Parameters:
-
create_crawler(crawler_or_spidercls)¶ Return a
Crawlerobject.- If crawler_or_spidercls is a Crawler, it is returned as-is.
- If crawler_or_spidercls is a Spider subclass, a new Crawler is constructed for it.
- If crawler_or_spidercls is a string, this function finds a spider with this name in a Scrapy project (using spider loader), then creates a Crawler instance for it.
-
stop()¶ Stops simultaneously all the crawling jobs taking place.
Returns a deferred that is fired when they all have ended.
-
-
class
scrapy.crawler.CrawlerProcess(settings=None, install_root_handler=True)¶ Bases:
scrapy.crawler.CrawlerRunnerA class to run multiple scrapy crawlers in a process simultaneously.
This class extends
CrawlerRunnerby adding support for starting a Twisted reactor and handling shutdown signals, like the keyboard interrupt command Ctrl-C. It also configures top-level logging.This utility should be a better fit than
CrawlerRunnerif you aren’t running another Twisted reactor within your application.The CrawlerProcess object must be instantiated with a
Settingsobject.Parameters: install_root_handler – whether to install root logging handler (default: True) This class shouldn’t be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that manually handle the crawling process. See Run Scrapy from a script for an example.
-
crawl(crawler_or_spidercls, *args, **kwargs)¶ Run a crawler with the provided arguments.
It will call the given Crawler’s
crawl()method, while keeping track of it so it can be stopped later.If crawler_or_spidercls isn’t a
Crawlerinstance, this method will try to create one using this parameter as the spider class given to it.Returns a deferred that is fired when the crawling is finished.
Parameters:
-
create_crawler(crawler_or_spidercls)¶ Return a
Crawlerobject.- If crawler_or_spidercls is a Crawler, it is returned as-is.
- If crawler_or_spidercls is a Spider subclass, a new Crawler is constructed for it.
- If crawler_or_spidercls is a string, this function finds a spider with this name in a Scrapy project (using spider loader), then creates a Crawler instance for it.
-
start(stop_after_crawl=True)¶ This method starts a Twisted reactor, adjusts its pool size to
REACTOR_THREADPOOL_MAXSIZE, and installs a DNS cache based onDNSCACHE_ENABLEDandDNSCACHE_SIZE.If stop_after_crawl is True, the reactor will be stopped after all crawlers have finished, using
join().Parameters: stop_after_crawl (boolean) – stop or not the reactor when all crawlers have finished
-
stop()¶ Stops simultaneously all the crawling jobs taking place.
Returns a deferred that is fired when they all have ended.
-
Settings API¶
-
scrapy.settings.SETTINGS_PRIORITIES¶ Dictionary that sets the key name and priority level of the default settings priorities used in Scrapy.
Each item defines a settings entry point, giving it a code name for identification and an integer priority. Greater priorities take more precedence over lesser ones when setting and retrieving values in the
Settingsclass.SETTINGS_PRIORITIES = { 'default': 0, 'command': 10, 'project': 20, 'spider': 30, 'cmdline': 40, }
For a detailed explanation on each settings sources, see: Settings.
-
scrapy.settings.get_settings_priority(priority)¶ Small helper function that looks up a given string priority in the
SETTINGS_PRIORITIESdictionary and returns its numerical value, or directly returns a given numerical priority.
-
class
scrapy.settings.Settings(values=None, priority=’project’)¶ Bases:
scrapy.settings.BaseSettingsThis object stores Scrapy settings for the configuration of internal components, and can be used for any further customization.
It is a direct subclass and supports all methods of
BaseSettings. Additionally, after instantiation of this class, the new object will have the global default settings described on Built-in settings reference already populated.
-
class
scrapy.settings.BaseSettings(values=None, priority=’project’)¶ Instances of this class behave like dictionaries, but store priorities along with their
(key, value)pairs, and can be frozen (i.e. marked immutable).Key-value entries can be passed on initialization with the
valuesargument, and they would take theprioritylevel (unlessvaluesis already an instance ofBaseSettings, in which case the existing priority levels will be kept). If thepriorityargument is a string, the priority name will be looked up inSETTINGS_PRIORITIES. Otherwise, a specific integer should be provided.Once the object is created, new settings can be loaded or updated with the
set()method, and can be accessed with the square bracket notation of dictionaries, or with theget()method of the instance and its value conversion variants. When requesting a stored key, the value with the highest priority will be retrieved.-
copy()¶ Make a deep copy of current settings.
This method returns a new instance of the
Settingsclass, populated with the same values and their priorities.Modifications to the new object won’t be reflected on the original settings.
-
copy_to_dict()¶ Make a copy of current settings and convert to a dict.
This method returns a new dict populated with the same values and their priorities as the current settings.
Modifications to the returned dict won’t be reflected on the original settings.
This method can be useful for example for printing settings in Scrapy shell.
-
freeze()¶ Disable further changes to the current settings.
After calling this method, the present state of the settings will become immutable. Trying to change values through the
set()method and its variants won’t be possible and will be alerted.
-
frozencopy()¶ Return an immutable copy of the current settings.
-
get(name, default=None)¶ Get a setting value without affecting its original type.
Parameters: - name (string) – the setting name
- default (any) – the value to return if no setting is found
-
getbool(name, default=False)¶ Get a setting value as a boolean.
1,'1', True` and'True'returnTrue, while0,'0',False,'False'andNonereturnFalse.For example, settings populated through environment variables set to
'0'will returnFalsewhen using this method.Parameters: - name (string) – the setting name
- default (any) – the value to return if no setting is found
-
getdict(name, default=None)¶ Get a setting value as a dictionary. If the setting original type is a dictionary, a copy of it will be returned. If it is a string it will be evaluated as a JSON dictionary. In the case that it is a
BaseSettingsinstance itself, it will be converted to a dictionary, containing all its current settings values as they would be returned byget(), and losing all information about priority and mutability.Parameters: - name (string) – the setting name
- default (any) – the value to return if no setting is found
-
getfloat(name, default=0.0)¶ Get a setting value as a float.
Parameters: - name (string) – the setting name
- default (any) – the value to return if no setting is found
-
getint(name, default=0)¶ Get a setting value as an int.
Parameters: - name (string) – the setting name
- default (any) – the value to return if no setting is found
-
getlist(name, default=None)¶ Get a setting value as a list. If the setting original type is a list, a copy of it will be returned. If it’s a string it will be split by “,”.
For example, settings populated through environment variables set to
'one,two'will return a list [‘one’, ‘two’] when using this method.Parameters: - name (string) – the setting name
- default (any) – the value to return if no setting is found
-
getpriority(name)¶ Return the current numerical priority value of a setting, or
Noneif the givennamedoes not exist.Parameters: name (string) – the setting name
-
getwithbase(name)¶ Get a composition of a dictionary-like setting and its _BASE counterpart.
Parameters: name (string) – name of the dictionary-like setting
-
maxpriority()¶ Return the numerical value of the highest priority present throughout all settings, or the numerical value for
defaultfromSETTINGS_PRIORITIESif there are no settings stored.
-
set(name, value, priority=’project’)¶ Store a key/value attribute with a given priority.
Settings should be populated before configuring the Crawler object (through the
configure()method), otherwise they won’t have any effect.Parameters: - name (string) – the setting name
- value (any) – the value to associate with the setting
- priority (string or int) – the priority of the setting. Should be a key of
SETTINGS_PRIORITIESor an integer
-
setmodule(module, priority=’project’)¶ Store settings from a module with a given priority.
This is a helper function that calls
set()for every globally declared uppercase variable ofmodulewith the providedpriority.Parameters: - module (module object or string) – the module or the path of the module
- priority (string or int) – the priority of the settings. Should be a key of
SETTINGS_PRIORITIESor an integer
-
update(values, priority=’project’)¶ Store key/value pairs with a given priority.
This is a helper function that calls
set()for every item ofvalueswith the providedpriority.If
valuesis a string, it is assumed to be JSON-encoded and parsed into a dict withjson.loads()first. If it is aBaseSettingsinstance, the per-key priorities will be used and thepriorityparameter ignored. This allows inserting/updating settings with different priorities with a single command.Parameters: - values (dict or string or
BaseSettings) – the settings names and values - priority (string or int) – the priority of the settings. Should be a key of
SETTINGS_PRIORITIESor an integer
- values (dict or string or
-
SpiderLoader API¶
-
class
scrapy.loader.SpiderLoader¶ This class is in charge of retrieving and handling the spider classes defined across the project.
Custom spider loaders can be employed by specifying their path in the
SPIDER_LOADER_CLASSproject setting. They must fully implement thescrapy.interfaces.ISpiderLoaderinterface to guarantee an errorless execution.-
from_settings(settings)¶ This class method is used by Scrapy to create an instance of the class. It’s called with the current project settings, and it loads the spiders found recursively in the modules of the
SPIDER_MODULESsetting.Parameters: settings ( Settingsinstance) – project settings
-
load(spider_name)¶ Get the Spider class with the given name. It’ll look into the previously loaded spiders for a spider class with name spider_name and will raise a KeyError if not found.
Parameters: spider_name (str) – spider class name
-
list()¶ Get the names of the available spiders in the project.
-
Signals API¶
-
class
scrapy.signalmanager.SignalManager(sender=_Anonymous)¶ -
connect(receiver, signal, **kwargs)¶ Connect a receiver function to a signal.
The signal can be any object, although Scrapy comes with some predefined signals that are documented in the Signals section.
Parameters: - receiver (callable) – the function to be connected
- signal (object) – the signal to connect to
-
disconnect(receiver, signal, **kwargs)¶ Disconnect a receiver function from a signal. This has the opposite effect of the
connect()method, and the arguments are the same.
-
disconnect_all(signal, **kwargs)¶ Disconnect all receivers from the given signal.
Parameters: signal (object) – the signal to disconnect from
-
send_catch_log(signal, **kwargs)¶ Send a signal, catch exceptions and log them.
The keyword arguments are passed to the signal handlers (connected through the
connect()method).
-
send_catch_log_deferred(signal, **kwargs)¶ Like
send_catch_log()but supports returning deferreds from signal handlers.Returns a Deferred that gets fired once all signal handlers deferreds were fired. Send a signal, catch exceptions and log them.
The keyword arguments are passed to the signal handlers (connected through the
connect()method).
-
Stats Collector API¶
There are several Stats Collectors available under the
scrapy.statscollectors module and they all implement the Stats
Collector API defined by the StatsCollector
class (which they all inherit from).
-
class
scrapy.statscollectors.StatsCollector¶ -
get_value(key, default=None)¶ Return the value for the given stats key or default if it doesn’t exist.
-
get_stats()¶ Get all stats from the currently running spider as a dict.
-
set_value(key, value)¶ Set the given value for the given stats key.
-
set_stats(stats)¶ Override the current stats with the dict passed in
statsargument.
-
inc_value(key, count=1, start=0)¶ Increment the value of the given stats key, by the given count, assuming the start value given (when it’s not set).
-
max_value(key, value)¶ Set the given value for the given key only if current value for the same key is lower than value. If there is no current value for the given key, the value is always set.
-
min_value(key, value)¶ Set the given value for the given key only if current value for the same key is greater than value. If there is no current value for the given key, the value is always set.
-
clear_stats()¶ Clear all stats.
The following methods are not part of the stats collection api but instead used when implementing custom stats collectors:
-
open_spider(spider)¶ Open the given spider for stats collection.
-
close_spider(spider)¶ Close the given spider. After this is called, no more specific stats can be accessed or collected.
-