Question

我需要从HTML文件中快速提取文本。我使用以下正则表达式而不是完整的解析器，因为我需要快速而不是准确（我有超过1 TB的文本）。分析器显示我的脚本中的大部分时间都花在re.sub过程中。什么是加快我的过程的好方法？我可以在C中实现一些部分，但是我想知道这是否会有所帮助，因为时间花费在里面 re.sub，我认为这将有效实现。

# Remove scripts, styles, tags, entities, and extraneous spaces:
scriptRx    = re.compile("<script.*?/script>", re.I)
styleRx     = re.compile("<style.*?/style>", re.I)
tagsRx      = re.compile("<[!/]?[a-zA-Z-]+[^<>]*>")
entitiesRx  = re.compile("&[0-9a-zA-Z]+;")
spacesRx    = re.compile("\s{2,}")
....
text = scriptRx.sub(" ", text)
text = styleRx.sub(" ", text)
....

谢谢！

Answer 1

首先，使用为此构建的HTML解析器，如BeautifulSoup：

http://www.crummy.com/software/BeautifulSoup/

然后，您可以使用分析器识别剩余的特定慢点：

http://docs.python.org/library/profile.html

为了学习正则表达式，我发现掌握正则表达式非常有价值，无论编程语言是什么：

http://oreilly.com/catalog/9781565922570

此外：

How can I debug a regular expression in python?

由于用例的重新声明，那么对于这个请求，我会说上面的内容不是你想要的。我的替代建议是：Speeding up regular expressions in Python

Answer 2

你正在处理每个文件五次，所以你应该做的第一件事（如Paul Sanwald所说）是试图通过将你的正则表达式组合在一起来减少这个数字。我也会避免使用不情愿的量词，这些量词是为了方便而牺牲效率而设计的。考虑一下这个正则表达式：

<script.*?</script>

每次.消费另一个角色时，首先必须确保</script>在该位置不匹配。这几乎就像在每个位置做一个负向前瞻：

<script(?:(?!</script>).)*</script>

但我们知道，如果下一个字符不是<，那么做前瞻是没有意义的，我们可以相应地调整正则表达式：

<script[^<]*(?:<(?!/script>)[^<]*)*</script>

当我使用此目标字符串在RegexBuddy中测试它们时：

<script type="text/javascript">var imagePath='http://sstatic.net/stackoverflow/img/';</script>

......不情愿的正则表达式需要173步才能完成匹配，而量身定制的正则表达式只需要28步。

将前三个正则表达式合并为一个产生这个野兽：

<(?:(script|style)[^<]*(?:<(?!/\1)[^<]*)*</\1>|[!/]?[a-zA-Z-]+[^<>]*>)

您可能希望在<HEAD>元素处理时将其删除（即(script|style|head)）。

我不知道你对第四个正则表达式做了什么，对于字符实体 - 你也只是删除它们吗？我猜第五个正则表达式必须单独运行，因为它清理的一些空格是由前面的步骤生成的。但尝试将前三个正则表达式结合起来，看看它有多大差异。这应该告诉你是否值得采用这种方法。

Answer 3

您可以做的一件事是使用反向引用来组合脚本/样式正则表达式。这是一些示例数据：

$ cat sample 
<script>some stuff</script>
<html>whatever </html>
<style>some other stuff</style>

使用perl：

perl -ne "if (/<(script|style)>.*?<\/\1>/) { print $1; } " sample

它将匹配脚本或样式。我推荐“掌握正则表达式”，这是一本很好的书。

Answer 4

使用HTML解析器的建议很好，因为它很可能比正则表达式更快。但我不确定BeautifulSoup是否适合这项工作，因为它从整个文件构造一个解析树并将整个内容存储在内存中。对于1TB的HTML，你需要一个淫秽的RAM才能做到这一点;-)我建议你看看HTMLParser，这是写在比BeautifulSoup更低的水平，但我相信它是一个流解析器，所以它一次只会加载一些文本。

Answer 5

如果您的用例确实要为每一百万个文档解析一些内容，那么我的上述答案将无济于事。我建议使用一些启发式方法，比如在它们上面开始使用几个“直接文本”正则表达式 - 就像普通/script/和/style/一样，如果可以的话，可以快速抛出一些内容。事实上，你真的需要进行终端标签检查吗？ <style不够好吗？为其他人留下验证。如果快速成功，那么将其余部分放入单个正则表达式，如/<script|<style|\s{2,}|etc.../，这样就不必为每个正则表达式执行一次这么多的文本。

Answer 6

我会使用普通Python分区的简单程序，比如这个，但只测试一个样式示例文件：

## simple filtering when not hierarchical tags inside other discarded tags

start_tags=('<style','<script')
end_tags=('</style>','</script>')

##print("input:\n %s" % open('giant.html').read())
out=open('cleaned.html','w')
end_tag=''

for line in open('giant.html'):
    line=' '.join(line.split())
    if end_tag:
        if end_tag in line:
            _,tag,end = line.partition(end_tags[index])
            if end.strip():
                out.write(end)
            end_tag=''
        continue ## discard rest of line if no end tag found in line

    found=( index for index in (start_tags.index(start_tag)
                                if start_tag in line else ''
                                for start_tag in start_tags)
            if index is not '')
    for index in  found:
        start,tag,end = line.partition(start_tags[index])
        # drop until closing angle bracket of start tag
        tag,_ ,end = end.partition('>')
        # check if closing tag already in same line
        if end_tags[index] in end:
            _,tag,end = end.partition(end_tags[index])
            if end.strip():
                out.write(end)
            end_tag = '' # end tag reset after found
        else:
            end_tag=end_tags[index]
            out.write(end) # no end tag at same line
    if not end_tag: out.write(line+'\n')

out.close()
##    print 'result:\n%s' % open('cleaned.html').read()

加速Python中的正则表达式

6 个答案: