Question

我正在使用Universal feed Parser解析RSS内容。在描述标签中有时候我正在寻找类似下面的线索：

<!--This is the XML comment -->
<p>This is a Test Paragraph</p></br>
<b>Sample Bold</b>
<m:Table>Sampe Text</m:Table>

为了删除HTML元素/标签，我使用以下正则表达式。

pattern = re.compile(u'<\/?\w+\s*[^>]*?\/?>', re.DOTALL | re.MULTILINE | re.IGNORECASE | re.UNICODE)
desc = pattern.sub(u" ", desc)

这有助于删除HTML标记，但不删除xml注释。如何删除elemnts和XML coments？

Answer 1

使用lxml：

import lxml.html as LH

content='''
<!--This is the XML comment -->
<p>This is a Test Paragraph</p></br>
<b>Sample Bold</b>
<Table>Sampe Text</Table>
'''

doc=LH.fromstring(content)
print(doc.text_content())

产量

This is a Test Paragraph
Sample Bold
Sampe Text

Answer 2

使用正则表达式这是一个坏主意。

在使用真正的解析器后我会导航DOM树并删除我想要的方式。

Answer 3

使用纯Python有一种简单的方法：

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

这个想法在这里解释：http://youtu.be/2tu9LTDujbw

您可以在此处看到它：http://youtu.be/HPkNPcYed9M?t=35s

PS - 如果您对该课程感兴趣（关于使用python进行智能调试），我会给你一个链接：http://www.udacity.com/overview/Course/cs259/CourseRev/1。免费！

欢迎你！

Answer 4

为什么这么复杂？ re.sub('<!\[CDATA\[(.*?)\]\]>|<.*?>', lambda m: m.group(1) or '', desc, flags=re.DOTALL)

如果您希望XML标记保持不变，您应该在http://www.whatwg.org/specs/web-apps/current-work/multipage/查看HTML标记列表并使用'(<!\[CDATA\[.*?\]\]>)||</?(?:tag names separated by pipes)(?:\s.*?)?>'正则表达式。

Python中的正则表达式，用于删除XML注释和HTML元素

4 个答案: