通过核心ＡＰＩ启动单个或多个scrapy爬虫

wumxiaozhu

2020-01-17

1. 可以使用API从脚本运行Scrapy，而不是运行Scrapy的典型方法scrapy crawl；Scrapy是基于Twisted异步网络库构建的，因此需要在Twisted容器内运行它，可以通过两个API来运行单个或多个爬虫scrapy.crawler.CrawlerProcess、scrapy.crawler.CrawlerRunner。

2. 启动爬虫的的第一个实用程序是scrapy.crawler.CrawlerProcess 。该类将为您启动Twisted reactor，配置日志记录并设置关闭处理程序，此类是所有Scrapy命令使用的类。

示例运行单个爬虫：

交流群：1029344413 源码、素材学习资料import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):

    # Your spider definition

    ...


process = CrawlerProcess({

    ‘USER_AGENT‘: ‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)‘

})


process.crawl(MySpider)

process.start() # the script will block here until the crawling is finished

通过CrawlerProcess传入参数，并使用get_project_settings获取Settings 项目设置的实例。

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


process = CrawlerProcess(get_project_settings())

# ‘followall‘ is the name of one of the spiders of the project.

process.crawl(‘followall‘, domain=‘scrapinghub.com‘)

process.start() # the script will block here until the crawling is finished

还有另一个Scrapy实例方式可以更好地控制爬虫运行过程：scrapy.crawler.CrawlerRunner。此类封装了一些简单的帮助程序来运行多个爬虫程序，但它不会以任何方式启动或干扰现有的爬虫。
使用此类，显式运行reactor。如果已有爬虫在运行想在同一个进程中开启另一个Scrapy，建议您使用CrawlerRunner 而不是CrawlerProcess。
注意，爬虫结束后需要手动关闭Twisted reactor，通过向CrawlerRunner.crawl方法返回的延迟添加回调来实现。
下面是它的用法示例，在MySpider完成运行后手动停止容器的回调。

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):

    # Your spider definition

    ...


configure_logging({‘LOG_FORMAT‘: ‘%(levelname)s: %(message)s‘})

runner = CrawlerRunner()


d = runner.crawl(MySpider)

d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until the crawling is finished

在同一个进程中运行多个蜘蛛

默认情况下，Scrapy在您运行时为每个进程运行一个蜘蛛。但是，Scrapy支持使用内部API为每个进程运行多个蜘蛛。

这是一个同时运行多个蜘蛛的示例：

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):

    # Your first spider definition

    ...

class MySpider2(scrapy.Spider):

    # Your second spider definition

    ...


process = CrawlerProcess()

process.crawl(MySpider1)

process.crawl(MySpider2)

process.start() # the script will block here until all crawling jobs are finished

使用CrawlerRunner示例：

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):

    # Your first spider definition

    ...

class MySpider2(scrapy.Spider):

    # Your second spider definition

    ...


configure_logging()

runner = CrawlerRunner()

runner.crawl(MySpider1)

runner.crawl(MySpider2)

d = runner.join()

d.addBoth(lambda _: reactor.stop())


reactor.run() # the script will block here until all crawling jobs are finished

相同的示例，但通过异步运行爬虫蛛：

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):

    # Your first spider definition

    ...

class MySpider2(scrapy.Spider):

    # Your second spider definition

    ...


configure_logging()

runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():

    yield runner.crawl(MySpider1)

    yield runner.crawl(MySpider2)

    reactor.stop()


crawl()

reactor.run() # the script will block here until the last crawl call is finished

scrapy

安科网

通过核心ＡＰＩ启动单个或多个scrapy爬虫

wumxiaozhu

在同一个进程中运行多个蜘蛛

wumxiaozhu

相关推荐

如何利用Scrapy爬虫框架抓取网页全部文章信息（上篇）

一分钟搞定Scrapy分布式爬虫、队列和布隆过滤器

一篇文章教会你理解Scrapy网络爬虫框架的工作原理和数据采集过程

在Scrapy中如何利用Xpath选择器从网页中采集目标数据——详细教程（下篇）

在Scrapy中如何利用Xpath选择器从网页中采集目标数据——详细教程（上篇）

手把手教你进行Scrapy中item类的实例化操作

如何改造 Scrapy 从而实现多网站大规模爬取？

二十六、Scrapy自定义命令

scrapy 管理部署的爬虫项目的python类

分布式爬虫部署基于scrapy和scrapy-redis

8_3 scrapy模拟登录人人网

Python爬虫 - scrapy

Scrapy爬虫

用scrapy爬取图片

scrapy基本知识

Python爬虫 - scrapy框架的基本操作

十八、scrapy内置媒体（图片和文件）下载方式

Scrapy爬虫

Python Scrapy图片爬取原理及代码实例

scrapy 详解

wumxiaozhu