爬虫04 /asyncio、selenium规避检测、动作链、无头浏览器

云之高水之远

2019-12-06

关注关注

4、asyncio、selenium\规避检测、动作链、无头浏览器

1. 协程asyncio

协程基础
- 特殊的函数
  - 就是async关键字修饰的一个函数的定义
  - 特殊之处：
    特殊函数被调用后会返回一个协程对象
    特殊函数调用后内部的程序语句没有被立即执行
- 协程
  - 对象。协程==特殊的函数。协程表示的就是一组特定的操作。
- 任务对象
  - 高级的协程（对协程的进一步的封装）/任务对象表示一组指定的操作
    任务对象==协程==特殊的函数
    任务对象==特殊的函数
  - 绑定回调/一般用于解析：
    task.add_done_callback(task)
    参数task：当前回调函数对应的任务对象
    task.result():返回的就是任务对象对应的特殊函数的返回值
- 事件循环对象
  - 创建事件循环对象
  - 将任务对象注册到该对象中并且开启该对象
  - 作用：loop可以将其内部注册的所有的任务对象进行异步执行
- 代码示例：
```
import asyncio
from time import sleep

# 特殊的函数
async def get_request(url):
    print('正在下载:',url)
    sleep(2)
    print('下载完毕：',url)

    return 'page_text'
# 回调函数的定义（普通的函数）
def parse(task):
    # 参数表示的就是任务对象
    print('i am callback!!!',task.result())

# 特殊函数的调用
c = get_request('www.lbzhk.com')

# 创建一个任务对象
task = asyncio.ensure_future(c)
# 给任务对象绑定一个回调函数
task.add_done_callback(parse)

# 创建一个事件循环对象
loop = asyncio.get_event_loop()
# 将任务对象注册到该对象中并且开启该对象
loop.run_until_complete(task)   # 让loop执行了一个任务
```

多任务协程

挂起：就是交出cpu的使用权。
wait(tasks):给每个任务对象赋予一个可被挂起的的权限
await：被用作特殊函数内部（被阻塞）
代码示例：

import asyncio
from time import sleep
import time
# 特殊的函数
async def get_request(url):
    print('正在下载:',url)
    await asyncio.sleep(2)
    print('下载完毕：',url)
    return 'i am page_text!!!'

def parse(task):
    page_text = task.result()
    print(page_text)

start = time.time()
urls = ['www.1.com','www.2.com','www.3.com']

tasks = []  # 存储的是所有的任务对象。多任务！
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)
    task.add_done_callback(parse)
    tasks.append(task)

loop = asyncio.get_event_loop()
# asyncio.wait(tasks):给每一个任务对象赋予一个可被挂起的权限
loop.run_until_complete(asyncio.wait(tasks))

print('总耗时：',time.time()-start)

2. aiohttp多任务异步爬虫

实现异步爬取的条件
- 不能在特殊函数内部出现不支持异步的模块代码，否则会中断整个的异步效果
- requests模块不支持异步
- aiohttp是一个支持异步的网络请求模块

使用aiohttp模块实现多任务异步爬虫的流程

环境安装
```
pip install aiohttp
```

编码流程：

大致的架构:

with aiohttp.ClientSession() as s:            
# s.get(url,headers,params,proxy="http://ip:port")
    with s.get(url) as response:
        # response.read()二进制/相当于requests的.content
        page_text = response.text()
        return page_text

细节补充：

在每一个with前加上async，标记是一个特殊函数
需要在每一个阻塞操作前加上await

async with aiohttp.ClientSession() as s:
    # s.get(url,headers,params,proxy="http://ip:port")
    async with await s.get(url) as response:
        # response.read()二进制（.content）
        page_text = await response.text()
        return page_text

代码示例：

import asyncio
import aiohttp
import time
from bs4 import BeautifulSoup
# 将被请求的url全部整合到一个列表中
urls = ['http://127.0.0.1:5000/bobo','http://127.0.0.1:5000/jay','http://127.0.0.1:5000/tom']
start = time.time()

async def get_request(url):
    async with aiohttp.ClientSession() as s:
        # s.get(url,headers,params,proxy="http://ip:port")
        async with await s.get(url) as response:
            # response.read()二进制（.content）
            page_text = await response.text()
            return page_text

def parse(task):
    page_text = task.result()
    soup = BeautifulSoup(page_text,'lxml')
    data = soup.find('div',class_="tang").text
    print(data)
tasks = []
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)
    task.add_done_callback(parse)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print('总耗时：',time.time()-start)

3. selenium的使用

selenium和爬虫之间的关联：
- 模拟登录
- 便捷的捕获到动态加载的数据
  特点：可见及可得
  缺点：效率低
selenium概念/安装
- 概念：基于浏览器自动化的一个模块。
- 环境的安装：
```
pip install selenium
```
selenium的具体使用
准备浏览器的驱动程序：http://chromedriver.storage.googleapis.com/index.html

selenium演示程序

from selenium import webdriver
from time import sleep

# 后面是你的浏览器驱动位置，记得前面加r'','r'是防止字符转义的
driver = webdriver.Chrome(r'chromedriver')
# 用get打开百度页面
driver.get("http://www.baidu.com")
# 查找页面的“设置”选项，并进行点击
driver.find_elements_by_link_text('设置')[0].click()
sleep(2)
# 打开设置后找到“搜索设置”选项，设置为每页显示50条
driver.find_elements_by_link_text('搜索设置')[0].click()
sleep(2)

# 选中每页显示50条
m = driver.find_element_by_id('nr')
sleep(2)
m.find_element_by_xpath('//*[@id="nr"]/option[3]').click()
m.find_element_by_xpath('.//option[3]').click()
sleep(2)

# 点击保存设置
driver.find_elements_by_class_name("prefpanelgo")[0].click()
sleep(2)

# 处理弹出的警告页面   确定accept() 和 取消dismiss()
driver.switch_to_alert().accept()
sleep(2)
# 找到百度的输入框，并输入 美女
driver.find_element_by_id('kw').send_keys('美女')
sleep(2)
# 点击搜索按钮
driver.find_element_by_id('su').click()
sleep(2)
# 在打开的页面中找到“Selenium - 开源中国社区”，并打开这个页面
driver.find_elements_by_link_text('美女_百度图片')[0].click()
sleep(3)

# 关闭浏览器
driver.quit()

selenium基本使用指令

from selenium import webdriver
bro = webdriver.Chrome(executable_path='./chromedriver.exe')

# 请求的发送：
bro.get(url)

# 标签定位
# 使用xpath定位
search = bro.find_element_by_xpath('//input[@id="key"]')  
# 使用id定位
search = bro.find_element_by_id('key')
# 使用class类值定位
search = bro.find_elements_by_class_name('prefpanelgo')

# 向指定标签中录入文本数据
search.send_keys('mac pro')
# 模拟点击
search.click()
# JS注入
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
# 处理弹出的警告页面   确定accept() 和 取消dismiss()
bro.switch_to_alert().accept()
# switch_to.frame进行指定子页面的切换
bro.switch_to.frame('iframeResult')

# 捕获到当前页面的数据
page_text = bro.page_source
# 保留当前页面截图
bro.save_screenshot('123.png')

# 关闭浏览器
bro.quit()

selenium简单使用示例代码：

from selenium import webdriver
from time import sleep
# 结合着浏览器的驱动实例化一个浏览器对象
bro = webdriver.Chrome(executable_path='./chromedriver.exe')

# 请求的发送
url = 'https://www.jd.com/'
bro.get(url)
sleep(1)
# 标签定位
# bro.find_element_by_xpath('//input[@id="key"]')
search = bro.find_element_by_id('key')
search.send_keys('mac pro')   # 向指定标签中录入文本数据
sleep(2)
btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()
sleep(2)
# JS注入
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')

# 捕获到当前页面的数据
page_text = bro.page_source
print(page_text)
sleep(3)

bro.quit()

动态加载数据的捕获代码示例：

http://125.35.6.84:81/xk/,对药监总局前3页的企业名称进行爬取

from selenium import webdriver
from lxml import etree
from time import sleep
bro = webdriver.Chrome(executable_path='./chromedriver.exe')
url = 'http://125.35.6.84:81/xk/'
bro.get(url)
page_text = bro.page_source

all_page_text = [page_text]
# 点击下一页
for i in range(2):
    # 获取标签
    nextPage = bro.find_element_by_xpath('//*[@id="pageIto_next"]')
    # 进行点击
    nextPage.click()
    sleep(1)
    all_page_text.append(bro.page_source)

# 对爬取到的数据进行解析
for page_text in all_page_text:
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//*[@id="gzlist"]/li')
    for li in li_list:
        name = li.xpath('./dl/@title')[0]
        print(name)

sleep(2)
bro.quit()

4. 动作链

动作链概念/使用流程
- ActionChains，一系列的行为动作
  动作链对象action和浏览器对象bro是独立的
- 使用流程：
  1. 实例化一个动作链对象，需要将指定的浏览器和动作链对象进行绑定
  2. 执行相关的连续的动作
  3. perform()立即执行动作链制定好的动作

示例代码：

from selenium import webdriver
from selenium.webdriver import ActionChains # 动作链
from time import sleep
bro = webdriver.Chrome(executable_path='./chromedriver.exe')

url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'

bro.get(url)
# NoSuchElementException:定位的标签是存在与iframe之中，则就会抛出这个错误
# 解决方法：switch_to.frame进行指定子页面的切换
bro.switch_to.frame('iframeResult')
div_tag = bro.find_element_by_xpath('//*[@id="draggable"]')

# 实例化一个动作链对象
action = ActionChains(bro)
action.click_and_hold(div_tag)   # 点击且长按

# perform()让动作链立即执行
for i in range(5):
    action.move_by_offset(xoffset=15,yoffset=15).perform()
    sleep(2)
action.release()
sleep(5)
bro.quit()

5. 12306模拟登录分析

模拟登录流程：
1. 将当前浏览器页面进行图片保存
2. 将验证码的局部区域进行裁剪
  - 捕获标签在页面中的位置信息
  - 裁剪范围对应的矩形区域
  - 使用Image工具进行指定区域的裁剪
3. 调用打码平台进行验证码的识别/返回对应的坐标位置

代码示例：

from selenium import webdriver
from selenium.webdriver import ActionChains
from time import sleep
from PIL import Image  # 安装PIL或者是Pillow
from CJY import Chaojiying_Client

# 封装一个识别验证码的函数
def transformCode(imgPath,imgType):
    chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')
    im = open(imgPath, 'rb').read()
    return chaojiying.PostPic(im, imgType)['pic_str']


bro = webdriver.Chrome(executable_path='./chromedriver.exe')

bro.get('https://kyfw.12306.cn/otn/login/init')
sleep(2)
# 将当前浏览器页面进行图片保存
bro.save_screenshot('./main.png')
# 将验证码的局部区域进行裁剪
# 捕获标签在页面中的位置信息
img_tag = bro.find_element_by_xpath('//*[@id="loginForm"]/div/ul[2]/li[4]/div/div/div[3]/img')
location = img_tag.location   # 标签的起始位置坐标（左下角坐标）
size = img_tag.size   # 标签的尺寸
# 裁剪范围对应的矩形区域
rangle = (int(location['x']),int(location['y']),int(location['x']+size['width']),int(location['y']+size['height']))
# 使用Image工具进行指定区域的裁剪
i = Image.open('./main.png')
frame = i.crop(rangle)   # crop就是根据指定的裁剪范围进行图片的截取
frame.save('code.png')

# 调用打码平台进行验证码的识别
result = transformCode('./code.png',9004)
print(result) #x1,y1|x2,y2|x3,y3

# x1,y1|x2,y2|x3,y3 ==>[[x1,y1],[x2,y2],[x3,y3]]
all_list = []    # [[x1,y1],[x2,y2],[x3,y3]]
if '|' in result:
    list_1 = result.split('|')
    count_1 = len(list_1)
    for i in range(count_1):
        xy_list = []
        x = int(list_1[i].split(',')[0])
        y = int(list_1[i].split(',')[1])
        xy_list.append(x)
        xy_list.append(y)
        all_list.append(xy_list)
else:
    x = int(result.split(',')[0])
    y = int(result.split(',')[1])
    xy_list = []
    xy_list.append(x)
    xy_list.append(y)
    all_list.append(xy_list)


for point in all_list:
    x = point[0]
    y = point[1]
    ActionChains(bro).move_to_element_with_offset(img_tag,x,y).click().perform()
    sleep(1)


bro.find_element_by_id('username').send_keys('xxxxxx')
sleep(1)
bro.find_element_by_id('password').send_keys('xxxx')
sleep(1)

bro.find_element_by_id('loginSub').click()

sleep(10)
print(bro.page_source)
bro.quit()

6. selenium规避风险

测试服务器是否有selenium检测机制
1. 正常打开一个网站进行window.navigator.webdriver的js注入，返回值为undefined
2. 使用selenium打开的页面，进行上述js注入返回的是true

规避检测代码示例：

# 规避检测
from selenium import webdriver
from selenium.webdriver import ChromeOptions
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])

bro = webdriver.Chrome(executable_path='./chromedriver.exe',options=option)

url = 'https://www.taobao.com/'

bro.get(url)

7. 无头浏览器

现有无头浏览器
- phantomJs
- 谷歌无头

无头浏览器代码示例：

# 无头浏览器
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')

bro = webdriver.Chrome(executable_path='./chromedriver.exe',chrome_options=chrome_options)
url = 'https://www.taobao.com/'
bro.get(url)
sleep(2)
bro.save_screenshot('123.png')

print(bro.page_source)

总结：

网络请求的模块：requests/urllib/aiohttp
aiohttp和requests的区别：
- 代理requests用poroxies，aiohttp用的是proxy
- 接收二进制文件requests用response.content，aiohttp用的是response.read()

ul 协程 url selenium