Question

我使用以下代码从RSS源获取结果：

try:  
desc = item.xpath('description')[0].text
if date is not None:
    desc =date +"\n"+"\n"+desc
except:
    desc = None

但有时这个描述在feed中包含很少的unicode html charecters，如下所示：

XML中的文字看起来像“和'以及其他＆amp; ...; stuff

在显示内容时，我不希望它们显示。是否有正则表达式来删除HTML标记。

Answer 1

我使用了一种名为“Unescaping XML”的东西，不知道它对你有用。

from xml.sax.saxutils import unescape

unescape("&lt; &amp; &gt;")

'< & >'




unescape("&apos; &quot;", {"&apos;": "'", "&quot;": '"'})

'\' "'

修改

刚看到这一点，可能是有趣的。（未经测试）：unescape with urllib