Question

我有这个xml文件列表。现在我必须从中过滤掉一些标签。问题是文本，其中有很多html标记和url，我需要纯文本。我想在循环中删除这些元素，然后将清理后的文本追加到我的新列表中。这是我到目前为止所做的。

    data = []
    for conv in root.findall('./conversations/conversation'):
        pattern = re.compile( r'!\b(((ht|f)tp(s?))\://)?(www.|[a-z].)[a-z0-9\-\.]+\.)(\:[0-9]+)*(/($|[a-z0-9\.\,\;\?\\\\\\\+&amp;%\$#\=~_\-]+))*\b!i')
        if pattern.search(conv.text):
           re.sub(pattern, ' ')
           data.append(conv.text)

我无法找到正确的正则表达式来删除此类br />;<br />之类的内容以及这样的网址：http://neocash43.blog.com/2011/07/26/psp-sport-assessment-neopets-the-wand-of-wishing/</a>

第二个问题是，使用这个xml根结构，我现在不知道如何将清理过的对话文本附加到我的新列表中。

Answer 1

pattern.web python模块有一个HTML to text函数，名为plaintext。默认情况下，此函数会删除所有HTML标记。对于URL，请使用现有的RegEx。

Answer 2

您可以尝试使用pyparsing库的http://pyparsing.wikispaces.com/file/view/htmlStripper.py/591745692/htmlStripper.py。我刚用Python 3.4在我的机器上使用过这个脚本。

如何使用Python删除HTML，Urls

2 个答案: