爬虫之路: 字体文件反爬二(动态字体文件)
上一篇解决了但页面的字体反爬, 这篇记录下如何解决动态字体文件, 编码不同, 文字顺序不同的情况
源码在最后
冷静分析页面
打开一个页面, 发现字体文件地址是动态的, 这个倒是好说, 写个正则, 就可以动态匹配出来
先下载下来一个新页面的字体文件, 做一下对比, 如图
mmp, 发现编码, 字体顺序那那都不一样, 这可就过分了, 心里一万个xxx在奔腾
头脑风暴ing.gif
(与伙伴对话ing...)
不着急, 还是要冷静下来, 再想想哪里还有突破点
同一个页面的字体文件地址是动态的, 但是, 里面的字体编码和顺序是不会变的呀
可以使用某一个页面的字体文件做一个标准的字体映射表呀!
好像发现了新世界的大门, 可门还没开开, 就被自己堵死了, 就想 做出来映射表然后呢!(又要奔腾了)
想呀想呀想呀想, 最后叫上小伙伴一起想
突然就想到了, 虽然那么多不一样, 但是, 但是, 相同文字的坐标点相同呀! 突然又打开了大门
首先排除特别的文字的情况下, 只是在这个字体文件的情况下, 60%的字坐标点一样
那剩下的怎么办呢! 先不管了, 先把这60%给弄出来
提取60%的字体映射表
制作标准编码文字映射表(某页字体文件为准)
def extract_ttf_file(self, file_name, get_word_map=True): _font = TTFont(file_name) uni_list = _font.getGlyphOrder()[1:] # 被替换的字体的列表 word_list = [ "坏", "少", "远", "大", "九", "左", "近", "呢", "十", "高", "着", "矮", "八", "二", "右", "是", "得", "的", "小", "短", "很", "一", "了", "地", "好", "多", "七", "不", "长", "低", "三", "五", "六", "下", "更", "和", "四", "上" ] utf_word_map = {} utf_coordinates_map = {} for index, uni_code in enumerate(uni_list): utf_word_map[uni_code] = word_list[index] utf_coordinates_map[uni_code] = list(_font[‘glyf‘][uni_code].coordinates) if get_word_map: return utf_word_map, utf_coordinates_map return utf_coordinates_map # self.local_utf_word_map, self.local_utf_coordinates_map = self.extract_ttf_file(self.local_ttf_name)
制作新标准编码映射表
下载要破解的字体文件, 并替换标准编码字体映射表
def replace_ttf_map(self): unicode_mlist_map = [] new_utf_coordinates_map = self.extract_ttf_file(self.download_ttf_name, get_word_map=False) for local_unicode, local_coordinate in self.local_utf_coordinates_map.items(): coordinate_equal_list = [] for new_unicode, new_coordinate in new_utf_coordinates_map.items(): if len(new_coordinate) == len(local_coordinate): coordinate_equal_list.append({"norm_key": local_unicode, "norm_coordinate": local_coordinate, "new_key": new_unicode, "new_coordinate": new_coordinate}) if len(coordinate_equal_list) == 1: unicode_mlist_map.append(coordinate_equal_list[0])for unicode_dict in unicode_mlist_map: self.new_unicode_map[unicode_dict["new_key"]] = self.local_utf_word_map[unicode_dict["norm_key"]] print("new unicode map extract success\n", self.new_unicode_map)
会得到22个字体的映射表, 共38个:
{‘uniED23‘: ‘坏‘, ‘uniED75‘: ‘少‘, ‘uniEC18‘: ‘九‘, ‘uniECB0‘: ‘左‘, ‘uniECF7‘: ‘呢‘, ‘uniEC7A‘: ‘高‘, ‘uniECA7‘: ‘矮‘, ‘uniEDD6‘: ‘八‘, ‘uniEDA0‘: ‘二‘, ‘uniEDF2‘: ‘右‘, ‘uniEC96‘: ‘的‘, ‘uniEDF0‘: ‘小‘, ‘uniEC8B‘: ‘短‘, ‘uniECDB‘: ‘一‘, ‘uniEDD5‘: ‘地‘, ‘uniED8F‘: ‘好‘, ‘uniECC1‘: ‘多‘, ‘uniEDBA‘: ‘七‘, ‘uniEC2A‘: ‘长‘, ‘uniED59‘: ‘下‘, ‘uniEC94‘: ‘和‘, ‘uniED73‘: ‘四‘}
替换40%的字体映射表
重组新标准映射表
接下来, 就用坐标点来解决, 以下为思路
使用两点坐标差来判断, 但是这个偏差值拿不准
相同文字, 坐标点几乎一致, 即所有坐标点相差的绝对值的和最小的就为同一个字
来先试试
def get_distence(self, norm_coordinate, new_coordinate): distance_total = 0 for index, coordinate_point in enumerate(norm_coordinate): distance_total += abs(new_coordinate[index][0] - coordinate_point[0]) + abs(new_coordinate[index][1] - coordinate_point[1]) return distance_total
然后在重组标准编码, 标准坐标, 新的编码, 和新坐标
(这是想, 找出最相近的坐标, 使用新坐标提取出标准编码, 然后用标准编码提取对应的文字, 在替换成使用本页用的编码映射表)
# 准备替换的编码坐标映射表 {"norm_key": local_unicode, "norm_coordinate": local_coordinate, "new_key": new_unicode, "new_coordinate": new_coordinate}
提取所有坐标点加起来最小的元素
def handle_subtraction(self, coordinate_equal_list): coordinate_min_list = [] for coordinate_equal in coordinate_equal_list: n = self.get_distence(coordinate_equal.get(‘norm_coordinate‘), coordinate_equal.get(‘new_coordinate‘)) coordinate_min_list.append(n) return coordinate_equal_list[coordinate_min_list.index(min(coordinate_min_list))]
替换, 生成新的标准映射表
self.new_unicode_map[unicode_dict["new_key"]] = self.local_utf_word_map[unicode_dict["norm_key"]]
加入判断
在以上替换60%的字体映射表再加入一个判断, 改成如下
def replace_ttf_map(self): unicode_mlist_map = [] new_utf_coordinates_map = self.extract_ttf_file(self.download_ttf_name, get_word_map=False) for local_unicode, local_coordinate in self.local_utf_coordinates_map.items(): coordinate_equal_list = [] for new_unicode, new_coordinate in new_utf_coordinates_map.items(): if len(new_coordinate) == len(local_coordinate): coordinate_equal_list.append({"norm_key": local_unicode, "norm_coordinate": local_coordinate, "new_key": new_unicode, "new_coordinate": new_coordinate}) if len(coordinate_equal_list) == 1: unicode_mlist_map.append(coordinate_equal_list[0]) elif len(coordinate_equal_list) > 1: min_word = self.handle_subtraction(coordinate_equal_list) unicode_mlist_map.append(min_word) for unicode_dict in unicode_mlist_map: self.new_unicode_map[unicode_dict["new_key"]] = self.local_utf_word_map[unicode_dict["norm_key"]] print("new unicode map extract success\n", self.new_unicode_map)
输出一个标准的坐标值, 这里我就不上图进行对比了, 经过对比, 发现没什么问题
{‘uniED23‘: ‘坏‘, ‘uniED75‘: ‘少‘, ‘uniEDC5‘: ‘远‘, ‘uniED5A‘: ‘大‘, ‘uniEC18‘: ‘九‘, ‘uniECB0‘: ‘左‘, ‘uniECA5‘: ‘近‘, ‘uniECF7‘: ‘呢‘, ‘uniED3F‘: ‘十‘, ‘uniEC7A‘: ‘高‘, ‘uniEC44‘: ‘着‘, ‘uniECA7‘: ‘矮‘, ‘uniEDD6‘: ‘八‘, ‘uniEDA0‘: ‘二‘, ‘uniEDF2‘: ‘右‘, ‘uniED09‘: ‘是‘, ‘uniEC32‘: ‘得‘, ‘uniEC96‘: ‘的‘, ‘uniEDF0‘: ‘小‘, ‘uniEC8B‘: ‘短‘, ‘uniED3D‘: ‘很‘, ‘uniECDB‘: ‘一‘, ‘uniEC60‘: ‘了‘, ‘uniEDD5‘: ‘地‘, ‘uniED8F‘: ‘好‘, ‘uniECC1‘: ‘多‘, ‘uniEDBA‘: ‘七‘, ‘uniED2D‘: ‘不‘, ‘uniEC2A‘: ‘长‘, ‘uniED11‘: ‘低‘, ‘uniEC5E‘: ‘三‘, ‘uniECDD‘: ‘五‘, ‘uniEDBC‘: ‘六‘, ‘uniED59‘: ‘下‘, ‘uniEE02‘: ‘更‘, ‘uniEC94‘: ‘和‘, ‘uniED73‘: ‘四‘, ‘uniED6A‘: ‘上‘}
补充
如有以上有错误, 恳求大神指出
其余的就和上篇的一致了, 点击这里查看
源码
# -*- coding: utf-8 -*- # @Author: Mehaei # @Date: 2020-01-10 14:51:53 # @Last Modified by: Mehaei # @Last Modified time: 2020-01-13 10:10:13 import re import os import requests from lxml import etree from fontTools.ttLib import TTFont class NotFoundFontFileUrl(Exception): pass class CarHomeFont(object): def __init__(self, url, *args, **kwargs): self.local_ttf_name = "norm_font.ttf" self.download_ttf_name = ‘new_font.ttf‘ self.new_unicode_map = {} self._making_local_code_map() self._download_ttf_file(url, self.download_ttf_name) def _download_ttf_file(self, url, file_name): self.page_html = self.download(url) or "" # 获取字体的连接文件 font_file_name = (re.findall(r",url\(‘(//.*\.ttf)?‘\) format", self.page_html) or [""])[0] if not font_file_name: raise NotFoundFontFileUrl("not found font file name") # 下载字体文件 file_content = self.download("https:%s" % font_file_name, content=True) # 讲字体文件保存到本地 with open(file_name, ‘wb‘) as f: f.write(file_content) print("font file download success") def _making_local_code_map(self): if not os.path.exists(self.local_ttf_name): # 这个url为标准字体文件地址, 如要更改, 请手动更改字体列表 url = "https://club.autohome.com.cn/bbs/thread/62c48ae0f0ae73ef/75904283-1.html" self._download_ttf_file(url, self.local_ttf_name) self.local_utf_word_map, self.local_utf_coordinates_map = self.extract_ttf_file(self.local_ttf_name) print("local ttf load done") def get_distence(self, norm_coordinate, new_coordinate): distance_total = 0 for index, coordinate_point in enumerate(norm_coordinate): distance_total += abs(new_coordinate[index][0] - coordinate_point[0]) + abs(new_coordinate[index][1] - coordinate_point[1]) return distance_total def handle_subtraction(self, coordinate_equal_list): coordinate_min_list = [] for coordinate_equal in coordinate_equal_list: n = self.get_distence(coordinate_equal.get(‘norm_coordinate‘), coordinate_equal.get(‘new_coordinate‘)) coordinate_min_list.append(n) return coordinate_equal_list[coordinate_min_list.index(min(coordinate_min_list))] def replace_ttf_map(self): unicode_mlist_map = [] new_utf_coordinates_map = self.extract_ttf_file(self.download_ttf_name, get_word_map=False) for local_unicode, local_coordinate in self.local_utf_coordinates_map.items(): coordinate_equal_list = [] for new_unicode, new_coordinate in new_utf_coordinates_map.items(): if len(new_coordinate) == len(local_coordinate): coordinate_equal_list.append({"norm_key": local_unicode, "norm_coordinate": local_coordinate, "new_key": new_unicode, "new_coordinate": new_coordinate}) if len(coordinate_equal_list) == 1: unicode_mlist_map.append(coordinate_equal_list[0]) elif len(coordinate_equal_list) > 1: min_word = self.handle_subtraction(coordinate_equal_list) unicode_mlist_map.append(min_word) for unicode_dict in unicode_mlist_map: self.new_unicode_map[unicode_dict["new_key"]] = self.local_utf_word_map[unicode_dict["norm_key"]] print("new unicode map extract success\n", self.new_unicode_map) def extract_ttf_file(self, file_name, get_word_map=True): _font = TTFont(file_name) uni_list = _font.getGlyphOrder()[1:] # 被替换的字体的列表 word_list = [ "坏", "少", "远", "大", "九", "左", "近", "呢", "十", "高", "着", "矮", "八", "二", "右", "是", "得", "的", "小", "短", "很", "一", "了", "地", "好", "多", "七", "不", "长", "低", "三", "五", "六", "下", "更", "和", "四", "上" ] utf_word_map = {} utf_coordinates_map = {} for index, uni_code in enumerate(uni_list): utf_word_map[uni_code] = word_list[index] utf_coordinates_map[uni_code] = list(_font[‘glyf‘][uni_code].coordinates) if get_word_map: return utf_word_map, utf_coordinates_map return utf_coordinates_map def repalce_source_code(self): replaced_html = self.page_html for utf_code, word in self.new_unicode_map.items(): replaced_html = replaced_html.replace("&#x%s;" % utf_code[3:].lower(), word) return replaced_html def get_subject_content(self): normal_html = self.repalce_source_code() # 使用xpath 获取 主贴 xp_html = etree.HTML(normal_html) subject_text = ‘‘.join(xp_html.xpath(‘//div[@xname="content"]//div[@class="tz-paragraph"]//text()‘)) return subject_text def download(self, url, *args, try_time=5, method="GET", content=False, **kwargs): kwargs.setdefault("headers", {}) kwargs["headers"].update({"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"}) while try_time: try: response = requests.request(method.upper(), url, *args, **kwargs) if response.ok: if content: return response.content return response.text else: continue except Exception as e: try_time -= 1 print("download error: %s" % e) if __name__ == "__main__": url = "https://club.autohome.com.cn/bbs/thread/34d6bcc159b717a9/85794510-1.html#pvareaid=6830286" car = CarHomeFont(url) car.replace_ttf_map() text = car.get_subject_content() print(text)