我有XML格式的数据。示例如下所示。我想从<text> tag
中提取数据。
这是我的XML数据。
<text>
The 40-Year-Old Virgin is a 2005 American buddy comedy
film about a middle-aged man's journey to finally have sex.
<h1>The plot</h1>
Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
<h1>Cast</h1>
<h1>Soundtrack</h1>
<h1>External Links</h1>
</text>
我只需要The 40-Year-Old Virgin is a 2005 American buddy comedy film about a middle-aged man's journey to finally have sex.
这可能吗?感谢
答案 0 :(得分:4)
使用XML解析器解析XML。使用lxml:
import lxml.etree as ET
content='''\
<text>
The 40-Year-Old Virgin is a 2005 American buddy comedy
film about a middle-aged man's journey to finally have sex.
<h1>The plot</h1>
Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
<h1>Cast</h1>
<h1>Soundtrack</h1>
<h1>External Links</h1>
</text>
'''
text=ET.fromstring(content)
print(text.text)
产量
The 40-Year-Old Virgin is a 2005 American buddy comedy
film about a middle-aged man's journey to finally have sex.
答案 1 :(得分:2)
不要使用正则表达式来解析XML / HTML。在python中使用适当的解析器,如BeautifulSoup或lxml。
答案 2 :(得分:2)
每当您发现自己正在查看XML数据并考虑正则表达式时,您应该停下来问问自己为什么不考虑使用真正的XML解析器。 XML的结构使其非常适合于正确的解析器,并使正则表达式令人沮丧。
如果必须使用正则表达式,则应执行以下操作: 直到您的文档发生变化!
import re
p = re.compile("<text>(.*)<h1>")
p.search(xml_text).group(1)
Spoiler:正则表达式可能是合适的,如果这只是一个需要快速而肮脏的解决方案的一次性问题。或者,如果您知道输入数据是相当静态的并且不能容忍解析器的开销,那么它们可能是合适的。
答案 3 :(得分:2)
以下是使用ElementTree
:
In [18]: import xml.etree.ElementTree as et
In [19]: t = et.parse('f.xml')
In [20]: print t.getroot().text.strip()
The 40-Year-Old Virgin is a 2005 American buddy comedy
film about a middle-aged man's journey to finally have sex.
答案 4 :(得分:1)
以下是使用xml.etree.ElementTree的示例:
>>> import xml.etree.ElementTree as ET
>>> data = """<text>
... The 40-Year-Old Virgin is a 2005 American buddy comedy
... film about a middle-aged man's journey to finally have sex.
...
... <h1>The plot</h1>
... Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
... <h1>Cast</h1>
...
... <h1>Soundtrack</h1>
...
... <h1>External Links</h1>
... </text>"""
>>> xml = ET.XML(data)
>>> xml.text
"\n The 40-Year-Old Virgin is a 2005 American buddy comedy\n film about a middle-aged man's journey to finally have sex.\n\n "
>>> xml.text.strip().replace('\n ', '')
"The 40-Year-Old Virgin is a 2005 American buddy comedy film about a middle-aged man's journey to finally have sex."
你去了!