Question

我有这样的字符串：

<h2 class="debateHeaderProp">This house believes that society benefits when we share personal information online.</h2>

在＆＃34;＆lt;＆＃34;之间删除任何内容的最佳方法是什么？和＆＃34;＆gt;＆＃34;只有离开＆＃34;这家酒店认为，当我们在线分享个人信息时，社会会受益？

Answer 1

这是一种方式（不确定它是否是＆＃34;最好的＆＃34;）

>>> from xml.etree.ElementTree import XML
>>> s = '<h2 class="debateHeaderProp">This house believes that society benefits when we share personal information online.</h2>'
>>> x = XML(s)
>>> x.text
'This house believes that society benefits when we share personal information online.'
>>>

Answer 2

XML是一种固有的分层数据格式，表示它的最自然的方式是使用树。 ET有两个类用于此目的 - ElementTree将整个XML文档表示为树，Element表示此树中的单个节点。与整个文档的交互（读取和写入文件）通常在ElementTree级别上完成。与单个XML元素及其子元素的交互在元素级别完成。

在parsing XML

中阅读更多内容

你也可以使用正则表达式：

>>> import re
>>> re.search(r'(?<=>).*(?=<)' ,s).group(0)
'This house believes that society benefits when we share personal information online.'

Answer 3

只有一行标记，使用专用解析器有点矫枉过正。但是，对于较大的数据集，使用BeautifulSoup之类的解析器是可行的方法。请参阅下面的示例。

from bs4 import BeautifulSoup as bsoup
import re

markup = """
<h2 class="debateHeaderProp">This house believes that society benefits when we share personal information online.</h2>
<span class="debateFormat">Oregon-Oxford, Cross Examination</span>
<div class="debateAffirmSide">On the affirmative: Foo Debate Club</div>
<div class="debateOpposeSide">On the opposition: Bar Debate Club</div>
"""
soup = bsoup(markup)

# Explicitly define the tag and class.
motion = soup.find("h2", class_="debateHeaderProp").get_text()
# Or just use the class.
d_format = soup.find(class_="debateFormat").get_text()
# And even use regex for more power.
teams = [t.get_text() for t in soup.find_all("div", class_=re.compile(r".*debate.*Side.*"))]

print "Our Debate for Today"
print "Motion:", motion
print "Format:", d_format
print teams[0]
print teams[1]

# Prints the following:
# Our Debate for Today
# Motion: This house believes that society benefits when we share personal information online.
# Format: Oregon-Oxford, Cross Examination
# On the affirmative: Foo Debate Club
# On the opposition: Bar Debate Club

另一个选择是使用类似于lxml的XML解析器。

如何从python字符串中删除HTML标记？

3 个答案: