Python 简单网页爬虫
网上的妹子图爬虫:只爬取一个人物相册
import requests from bs4 import BeautifulSoup headers = { ‘User-Agent‘:‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)‘, ‘Referer‘:‘http://www.mzitu.com‘ } # 初始链接 start_url = ‘https://www.mzitu.com/161470‘ start_html = requests.get(start_url,headers=headers) #生成一个response对象 # print(start_html.text) #text是类型,如果是多媒体,则是content soup = BeautifulSoup(start_html.content,‘lxml‘) max_span=soup.find(‘div‘,class_=‘pagenavi‘).find_all(‘span‘)[-2].get_text() for page in range(1,int(max_span)+1): page_url = start_url+‘/‘+str(page) #给初始链接加上页码数,就是某页的链接地址 image_page = requests.get(page_url,headers=headers) # print(image_page.text) image_soup = BeautifulSoup(image_page.content,‘lxml‘) image_url = image_soup.find(‘div‘,class_=‘main-image‘).find(‘img‘)[‘src‘] #找到img标签的src属性的值,如<img src=‘lslsls‘>,则返回lslsls name = str(image_url) #别忘了转换类型 #print(name) img = requests.get(name,headers = headers) fpath = ‘C:\\Users\\wztshine\\Desktop\\新建文件夹\\‘+name[-7:] #对name参数切片,从倒数第七个开始。 with open(fpath, ‘wb‘) as f: print(‘output:‘, fpath) f.write(img.content)
相关推荐
夜斗不是神 2020-11-17
染血白衣 2020-11-16
ARCXIANG 2020-11-02
ARCXIANG 2020-10-28
CycloneKid 2020-10-27
荒谬小孩 2020-10-26
逍遥友 2020-10-26
snakeson 2020-10-09
meylovezn 2020-08-28
囧芝麻 2020-08-17
数据挖掘工人 2020-08-15
cxcxrs 2020-07-28
dashoumeixi 2020-07-20
sunzhihaofuture 2020-07-19
我欲疾风前行 2020-07-06
sunzhihaofuture 2020-07-04
Dimples 2020-06-28