增量式爬虫
增量式爬虫
概念:监测网站数据更新的情况。
核心:去重!!!
主要有两种情况:
深度爬取类型
深度爬取类型的网站中需要对详情页的url进行记录和检测
- 记录:将爬取过的详情页的url进行记录保存
- url存储到redis的set中
检测:如果对某一个详情页的url发起请求之前先要取记录表中进行查看,该url是否存在,存在的话认为着这个url已经被爬取过了。
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from zjs_moviePro.items import ZjsMovieproItem class MovieSpider(CrawlSpider): name = ‘movie‘ conn = Redis(host=‘127.0.0.1‘,port=6379) # allowed_domains = [‘www.xxx.com‘] start_urls = [‘https://www.4567tv.tv/index.php/vod/show/id/6.html‘] rules = (#/index.php/vod/show/id/6/page/2.html Rule(LinkExtractor(allow=r‘id/6/page/\d+\.html‘), callback=‘parse_item‘, follow=False), ) def parse_item(self, response): li_list = response.xpath(‘/html/body/div[1]/div/div/div/div[2]/ul/li‘) for li in li_list: name = li.xpath(‘./div/div/h4/a/text()‘).extract_first() detail_url = ‘https://www.4567tv.tv‘+li.xpath(‘./div/div/h4/a/@href‘).extract_first() ex = self.conn.sadd(‘movie_detail_urls‘,detail_url) if ex == 1:#向redis的set中成功插入了detail_url print(‘有最新数据可爬......‘) item = ZjsMovieproItem() item[‘name‘] = name yield scrapy.Request(url=detail_url,callback=self.parse_detail,meta={‘item‘:item}) else: print(‘该数据已经被爬取过了!‘) def parse_detail(self,response): item = response.meta[‘item‘] desc = response.xpath(‘/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()‘).extract_first() item[‘desc‘] = desc yield item
非深度爬取类型
核心名词:数据指纹
一组数据的唯一标识
相关推荐
URML 2020-02-16
victorzhzh 2019-10-31
yubang 2015-01-19
willcoder 2018-04-17
xiaonian 2015-01-28
longwang 2014-10-21
stevenchen 2014-10-14
javacoffe 2015-05-19
风随花落 2019-06-27
我有一只小松鼠 2019-06-26
技术渣的碎片 2019-06-20
zongzi4302 2017-05-04
manlsq 2018-09-29
violetjack 2018-07-17
专一amp专注 2018-11-30
fujing 2019-01-07
sfuncc 2019-04-28
oraclewindows 2016-12-21