如何从python字符串中删除HTML标记?

时间:2014-10-24 21:42:19

标签: python regex

我有这样的字符串:

<h2 class="debateHeaderProp">This house believes that society benefits when we share personal information online.</h2>

在&#34;&lt;&#34;之间删除任何内容的最佳方法是什么?和&#34;&gt;&#34;只有离开&#34;这家酒店认为,当我们在线分享个人信息时,社会会受益?

3 个答案:

答案 0 :(得分:0)

这是一种方式(不确定它是否是&#34;最好的&#34;)

>>> from xml.etree.ElementTree import XML
>>> s = '<h2 class="debateHeaderProp">This house believes that society benefits when we share personal information online.</h2>'
>>> x = XML(s)
>>> x.text
'This house believes that society benefits when we share personal information online.'
>>>

答案 1 :(得分:0)

  

XML是一种固有的分层数据格式,表示它的最自然的方式是使用树。 ET有两个类用于此目的 - ElementTree将整个XML文档表示为树,Element表示此树中的单个节点。与整个文档的交互(读取和写入文件)通常在ElementTree级别上完成。与单个XML元素及其子元素的交互在元素级别完成。

parsing XML

中阅读更多内容

你也可以使用正则表达式:

>>> import re
>>> re.search(r'(?<=>).*(?=<)' ,s).group(0)
'This house believes that society benefits when we share personal information online.'

答案 2 :(得分:0)

只有一行标记,使用专用解析器有点矫枉过正。但是,对于较大的数据集,使用BeautifulSoup之类的解析器是可行的方法。请参阅下面的示例。

from bs4 import BeautifulSoup as bsoup
import re

markup = """
<h2 class="debateHeaderProp">This house believes that society benefits when we share personal information online.</h2>
<span class="debateFormat">Oregon-Oxford, Cross Examination</span>
<div class="debateAffirmSide">On the affirmative: Foo Debate Club</div>
<div class="debateOpposeSide">On the opposition: Bar Debate Club</div>
"""
soup = bsoup(markup)

# Explicitly define the tag and class.
motion = soup.find("h2", class_="debateHeaderProp").get_text()
# Or just use the class.
d_format = soup.find(class_="debateFormat").get_text()
# And even use regex for more power.
teams = [t.get_text() for t in soup.find_all("div", class_=re.compile(r".*debate.*Side.*"))]

print "Our Debate for Today"
print "Motion:", motion
print "Format:", d_format
print teams[0]
print teams[1]

# Prints the following:
# Our Debate for Today
# Motion: This house believes that society benefits when we share personal information online.
# Format: Oregon-Oxford, Cross Examination
# On the affirmative: Foo Debate Club
# On the opposition: Bar Debate Club

另一个选择是使用类似于lxml的XML解析器。