利用IronPython做更灵活的网页爬虫
我们经常需要去别的网站采集一些信息,.NET下所有相关的技术都已经非常成熟,用Webrequest抓取页面,虽然支持自定义Reference头,又支持cookie,解析页面一般都是用正则,但是对方网站结构一变,还得重新改代码,重新编译,发布。有了IronPython,可以把抓取和分析的逻辑做成Python脚本,如果对方页面结构变了,只需修改脚本就行了,不需重新编译软件,这样可以用c#做交互和界面部分,用Python封装预期经常变化的部分。
如何利用IronPython把抓取和分析的逻辑做成Python脚本
安装好IronPython和vs.net 2010后,还需要下载一个SGMLReader(见参考链接),这个组件可以把格式不是很严格的HTML转换成格式良好的XML文件,甚至还能增加DTD的验证
我们以抓取百度贴吧页面为例,新建一个Console项目,引用IronPython,Microsoft.Dynamic,Microsoft.Scripting,SgmlReaderDll这些组件,把SGMLReader里的Html.dtd复制到项目目录下,如果没有这个,它会根据doctype去网络上找dtd,然后新建baidu.py的文件,最后在项目属性的生成事件里写上如下代码,把这两个文件拷贝到目标目录里。
copy $(ProjectDir)\*.py $(TargetDir) copy $(ProjectDir)\*.dtd $(TargetDir)
import clr, sys clr.AddReference("SgmlReaderDll") clr.AddReference("System.Xml")
from Sgml import * from System.Net import * from System.IO import TextReader,StreamReader from System.Xml import * from System.Text.UnicodeEncoding import UTF8
def fromHtml(textReader): sgmlReader = SgmlReader() sgmlReader.SystemLiteral = "html.dtd" sgmlReader.WhitespaceHandling =WhitespaceHandling.All sgmlReader.CaseFolding = CaseFolding.ToLower sgmlReader.InputStream = textReader doc = XmlDocument() doc.PreserveWhitespace = True doc.XmlResolver = None doc.Load(sgmlReader) eturn doc
def getWebData(url, method, data = None, cookie = None, encoding = "UTF-8"): req = WebRequest.Create(url) req.Method = method if cookie != None: req.CookieContainer = cookie if data != None: stream = req.GetRequestStream() stream.Write(data, 0, data.Length) rsp = req.GetResponse() reader = StreamReader(rsp.GetResponseStre(), UTF8.GetEncoding(encoding)) return reader
class Post: def __init__(self, hit, comments, title, link, author): self.hit = hit self.comments = comments self.title = title self.link = link self.author = author
class BaiDu: def __init__(self,encoding): self.cc = self.cc = CookieContainer() self.encoding = encoding self.posts = []
def getPosts(self, url): reader = getWebData(url, "GET", None, self.cc, self.encoding) doc = fromHtml(reader) trs = doc.SelectNodes("html//table[@id='thread_list_table']/tbody/tr") self.parsePosts(trs) def parsePosts(self, trs): for tr in trs: tds = tr.SelectNodes("td") hit = tds[0].InnerText comments = tds[1].InnerText title = tds[2].ChildNodes[1].InnerText link = tds[2].ChildNodes[1].Attributes["href"] author = tds[3].InnerText post = Post(hit, comments, title, link, author) self.posts.append(post)