「Python3爬虫教程」爬取百度故事数据分析网站
适用于有且只有一点Python3和网页基础的朋友,大牛&路人请绕道
有需要Python学习资料的大哥大姐吗?小编整理一套Python资料和PDF,感兴趣者可以关注小编后私信学习资料(是关注后私信哦)反正闲着也是闲着呢,不如学点东西啦
操作环境
- win10 1803 64位
- Chrome 68.0.3440.106(正式版本) (64 位)
- pycharm-UI(pycharm专业版) 2018.2
- python-365
库(非自带库用pip直接安装就行):
- pymysql :import pymysql
- requests :import requests
- json(自带) :import json
- Faker: :from faker import Faker
首先选取目标
目标网站是这个,url为: http://gpyd.gp241.com/nyqpc/bd2.html?id=20110052 ,
1.首先肯定是抓取一下post/get地址
进入首页后点击"点击领取9月牛股"弹出对话框后,按F12弹出开发者工具
在开发者工具中选中 "Network" ,随后点击网页中的点击领取,会看到 network 中多出来一条文件信息
然后提取一下我们需要的数据放到pycharm中,并整理成这种json格式:
2.这样我们就得到了这些数据:
向网站发送一条数据
可是,总不能只发送一次吧
这里先介绍一下python中最假的库--Faker
其实这个库的"造假"功能出乎意料的强大,有兴趣的可以去了解一下
这样之后我们的代码便成了如下代码:
然后就快要完成了,为了方便循环发送数据,我们再把它整理成一段函数:
其实我一开始学python真的不喜欢写函数,毕竟那么两行代码就能写完了,包装成一个函数简直就是在凑代码行数,毫无用途,但是我今天看到了一个故事:
为了检测空的奶盒子,博士后和农民用两种方式解决了这个问题:发明一台机器,使用了一台风扇
但是很多时候我们新学东西时遇到的问题都可以用以前就会的方法解决这个问题,但是随着问题的深入,有时候就只能使用新学的只是来解决以后遇到的问题了,写写函数(包装成类)总是没错的,前提是这个代码你是用来练手的,而不是用来应急的.
import requests import json from faker import Faker f = Faker(locale="zh-CN") def duang(): user_agent = f.user_agent() phone = f.phone_number() url = r"https: // download.zslxt.com / tinterface.php" headers = { "Host": "download.zslxt.com", "User-Agent": user_agent, "Accept": "*/*", "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "Accept-Encoding": "gzip, deflate, br", "Referer": "http:/gpyd.gp241.com/nyqpc/bd2.html?id=20110052", "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8", "Content-Length": "107", "Origin": "http://gpyd.gp241.com", "Connection": "keep-alive" } data = { "bm": "gbk", "gpdm": "", "id": "20110052", "phone": phone, "qudao": 98, "remarks": "牛有圈百度2)" } req = requests.post(url=url, headers=headers, data=json.dumps(data)) return user_agent, phone, req
这样我们就可以方便的进行调用了,写个main函数来调用它
import requests import json from faker import Faker f = Faker(locale="zh-CN") def duang(): user_agent = f.user_agent() phone = f.phone_number() url = r"https: // download.zslxt.com / tinterface.php" headers = { "Host": "download.zslxt.com", "User-Agent": user_agent, "Accept": "*/*", "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "Accept-Encoding": "gzip, deflate, br", "Referer": "http:/gpyd.gp241.com/nyqpc/bd2.html?id=20110052", "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8", "Content-Length": "107", "Origin": "http://gpyd.gp241.com", "Connection": "keep-alive" } data = { "bm": "gbk", "gpdm": "", "id": "20110052", "phone": phone, "qudao": 98, "remarks": "牛有圈百度2)" } req = requests.post(url=url, headers=headers, data=json.dumps(data)) return user_agent, phone, req if __name__ == '__main__': for i in range(100000): user_agent, phone, req = duang() print(i, ' ', phone, ' ', req.status_code, ' ', user_agent)
这里就是输出一下信息啦,刚才出去吃饭的时候断网了,只跑了3000多,这里就不截图了(如果真有用来练手的朋友可以尝试自己完善一下代码,断网后也可以等待并继续执行)
附上全部代码(写到mysql了):
import pymysql import requests import json from faker import Faker f = Faker(locale="zh-CN") def duang(): user_agent = f.user_agent() phone = f.phone_number() url = r"https: // download.zslxt.com / tinterface.php" headers = { "Host": "download.zslxt.com", "User-Agent": user_agent, "Accept": "*/*", "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "Accept-Encoding": "gzip, deflate, br", "Referer": "http:/gpyd.gp241.com/nyqpc/bd2.html?id=20110052", "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8", "Content-Length": "107", "Origin": "http://gpyd.gp241.com", "Connection": "keep-alive" } data = { "bm": "gbk", "gpdm": "", "id": "20110052", "phone": phone, "qudao": 98, "remarks": "牛有圈百度2)" } req = requests.post(url=url, headers=headers, data=json.dumps(data)) return user_agent, phone, req.status_code if __name__ == '__main__': for i in range(100000): user_agent, phone, status_code = duang() db = pymysql.connect("localhost", "root", "xiaoyan", "python") cur = db.cursor() cur.execute(f"INSERT INTO python1duang VALUES(default,'{user_agent}','{phone}','{status_code}')") db.commit() print(i, ' ', phone, ' ', status_code, ' ', user_agent) db.close()
有需要Python学习资料的大哥大姐吗?小编整理一套Python资料和PDF,感兴趣者可以关注小编后私信学习资料(是关注后私信哦)反正闲着也是闲着呢,不如学点东西啦