python使用Scrapy框架抓取起点中文网免费小说案例
使用工具,ubuntu,python,pycharm
一、使用pycharm创建项目:过程略
二、安装scrapy框架
pip install Scrapy
三、创建scrapy项目:
1.创建爬虫项目
scrapy startproject qidian
2.创建爬虫,先进入爬虫项目目录
cd qidian/ scrapy genspider book book.qidian.com
创建完成后项目目录如下
目录下的的book.py就是我们的爬虫文件
四、打开book.py编写爬虫的代码
1.进入需要爬去的书的目录,找到开始url 设置start_url:
#鬼吹灯图书目录 start_urls = ['https://book.qidian.com/info/53269#Catalog']
2、在创建项目的时候,筛选的url地址为:
allowed_domains = ['book.qidian.com']
打开图书章节后发现章节的url如下: # https://read.qidian.com/chapter/PNjTiyCikMo1/FzxWdm35gIE1 所以需要将read.qidian.com 加入allowed_domains 中,
allowed_domains = ['book.qidian.com', 'read.qidian.com']
剩下的就是通过xpath 获取抓取到的内容,提取我们需要的内容 完整代码如下
# -*- coding: utf-8 -*- import scrapy import logging logger = logging.getLogger(__name__) class BookSpider(scrapy.Spider): name = 'book' allowed_domains = ['book.qidian.com', 'read.qidian.com'] start_urls = ['https://book.qidian.com/info/53269#Catalog'] def parse(self, response): # 获取章节列表 li_list = response.xpath('//div[@class="volume"][2]/ul/li') # 列表循环取出章节名称和章节对应的url for li in li_list: item = {} # 章节名称 item['chapter_name'] = li.xpath('./a/text()').extract_first() # 章节url item['chapter_url'] = li.xpath('./a/@href').extract_first() # 获取到的url //read.qidian.com/chapter/PNjTiyCikMo1/TpiSLsyH5Hc1 # 需要重新构造 item['chapter_url'] = 'https:' + item['chapter_url'] # 循环抓取每个章节的内容 if item['chapter_url'] is not None: # meta:传递item数据 yield scrapy.Request(item['chapter_url'], callback=self.parse_chapter, meta={'item': item}) def parse_chapter(self, response): item = response.meta['item'] # 获取文章内容 item['chapter_content'] = response.xpath('//div[@class="read-content j_readContent"]/p/text()').extract() yield item
五、将爬去数据保存到mongodb中
1.修改setting文件 找到并打开注释:
ITEM_PIPELINES = { 'qidain.pipelines.QidainPipeline': 300, }
2.添加monggodb相关配置
# 主机地址 MONGODB_HOST = '127.0.0.1' # 端口 MONGODB_PORT = 27017 # 需要保存的数据哭名字 MONGODB_DBNAME = 'qidian' # 保存的文件名 MONGODB_DOCNAME = 'dmbj'
3.在pipelines.py文件中保存数据,最终文件内容如下
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.conf import settings import pymongo class QidainPipeline(object): def __init__(self): '''在__init__中配置mongodb''' host = settings['MONGODB_HOST'] port = settings['MONGODB_PORT'] db_name = settings['MONGODB_DBNAME'] client = pymongo.MongoClient(host=host, port=port) db = client[db_name] self.post = db[settings['MONGODB_DOCNAME']] def process_item(self, item, spider): self.post.insert(item) return item
相关推荐
YENCSDN 2020-11-17
lsjweiyi 2020-11-17
houmenghu 2020-11-17
Erick 2020-11-17
HeyShHeyou 2020-11-17
以梦为马不负韶华 2020-10-20
lhtzbj 2020-11-17
夜斗不是神 2020-11-17
pythonjw 2020-11-17
dingwun 2020-11-16
lhxxhl 2020-11-16
坚持是一种品质 2020-11-16
染血白衣 2020-11-16
huavhuahua 2020-11-20
meylovezn 2020-11-20
逍遥友 2020-11-20
weiiron 2020-11-16