Question

可能重复：
Strip html from strings in python

在制作像应用程序这样的小浏览器时，我面临着拆分不同标签的问题。考虑字符串

<html> <h1> good morning </h1> welcome </html>

我需要以下输出： ['早上好'，'欢迎']

我怎么能在python中做到这一点？

Answer 1

我会使用xml.etree.ElementTree：

def get_text(etree):
    for child in etree:
        if child.text:
           yield child.text
        if child.tail:
           yield child.tail

import xml.etree.ElementTree as ET
root = ET.fromstring('<html> <h1> good morning </h1> welcome </html>')
print list(get_text(root))

Answer 2

您可以使用pythons html / xml解析器之一。

美味的汤很受欢迎。 lmxl也很受欢迎。

以上是您可以使用标准库的第三方版本

http://docs.python.org/library/xml.etree.elementtree.html

Answer 3

我会使用python库Beautiful Soup来实现你的目标。它的帮助只有几行：

from bs4 import BeautifulSoup
soup = BeautifulSoup('<html> <h1> good morning </h1> welcome </html>')
print [text for text in soup.stripped_strings]

如何在python中删除html标签内的文本？

3 个答案: