Python转换HTML到Text纯文本的方法
本文实例讲述了Python转换HTML到Text纯文本的方法。分享给大家供大家参考。具体分析如下:
今天项目需要将HTML转换为纯文本,去网上搜了一下,发现Python果然是神通广大,无所不能,方法是五花八门。
拿今天亲自试的两个方法举例,以方便后人:
方法一:
1. 安装nltk,可以去pipy装
(注:需要依赖以下包:numpy, PyYAML)
2.测试代码:
代码如下:
>>> import nltk >>> aa = r''''' <html> <body> <b>Project:</b> DeHTML<br> <b>Description</b>:<br> This small script is intended to allow conversion from HTML markup to plain text. </body> </html> ''' >>> aa '\n<html>\n <body>\n <b>Project:</b> DeHTML<br>\n <b>Description</b>:<br>\n This small script is intended to allow conversion from HTML markup to \n plain text.\n </body>\n </html>\n ' >>> <strong>print nltk.clean_html(aa)</strong> Project: DeHTML Description : This small script is intended to allow conversion from HTML markup to plain text.
方法二:
如果觉得nltk太笨重,大材小用的话,可以自己写代码,代码如下:
代码如下:
from HTMLParser import HTMLParser from re import sub from sys import stderr from traceback import print_exc class _DeHTMLParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.__text = [] def handle_data(self, data): text = data.strip() if len(text) > 0: text = sub('[ \t\r\n]+', ' ', text) self.__text.append(text + ' ') def handle_starttag(self, tag, attrs): if tag == 'p': self.__text.append('\n\n') elif tag == 'br': self.__text.append('\n') def handle_startendtag(self, tag, attrs): if tag == 'br': self.__text.append('\n\n') def text(self): return ''.join(self.__text).strip() def dehtml(text): try: parser = _DeHTMLParser() parser.feed(text) parser.close() return parser.text() except: print_exc(file=stderr) return text def main(): text = r''''' <html> <body> <b>Project:</b> DeHTML<br> <b>Description</b>:<br> This small script is intended to allow conversion from HTML markup to plain text. </body> </html> ''' print(dehtml(text)) if __name__ == '__main__': main()
运行结果:
>>> ================================ RESTART ================================
>>>
Project: DeHTML
Description :
This small script is intended to allow conversion from HTML markup to plain text.
希望本文所述对大家的Python程序设计有所帮助。
相关推荐
YENCSDN 2020-11-17
lsjweiyi 2020-11-17
houmenghu 2020-11-17
Erick 2020-11-17
HeyShHeyou 2020-11-17
以梦为马不负韶华 2020-10-20
lhtzbj 2020-11-17
夜斗不是神 2020-11-17
pythonjw 2020-11-17
dingwun 2020-11-16
lhxxhl 2020-11-16
坚持是一种品质 2020-11-16
染血白衣 2020-11-16
huavhuahua 2020-11-20
meylovezn 2020-11-20
逍遥友 2020-11-20
weiiron 2020-11-16