Question

我必须在常见爬网数据集（warc.gz文件）中解析html内容。我决定使用bs4（Beautifulsoup）模块，因为大多数人都建议使用它。以下是获取文字的代码段：

from bs4 import BeautifulSoup

soup = BeautifulSoup(src, "lxml")
[x.extract() for x in soup.findAll(['script', 'style'])]
txt = soup.get_text().encode('utf8')

没有bs4，一个文件在9分钟内完成处理（测试用例）但是如果我使用bs4来解析文本，那么Job将在大约4小时内完成。这是怎么回事。除了bs4之外还有更好的解决方案吗？注意：bs4是包含许多模块的类，如Beautifilsoup。

Answer 1

这里主要耗时的部分是在列表压缩中提取标签。使用lxml和python正则表达式，您可以执行以下操作。

import re

script_pat = re.compile(r'<script.*?<\/script>')

# to find all scripts tags
script_pat.findall(src)

# do your stuff
print re.sub(script_pat, '', src)

使用lxml你可以这样做：

from lxml import html, tostring
et = html.fromstring(src)

# remove the tags
[x.drop_tag() for x in et.xpath('//script')]

# do your stuff
print tostring(et)

美丽汤在常见爬行数据中花费太多时间进行文本提取

1 个答案: