Question

我有XML格式的数据。示例如下所示。我想从<text> tag中提取数据。这是我的XML数据。

    <text>
    The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.

    <h1>The plot</h1>
    Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
    <h1>Cast</h1>

    <h1>Soundtrack</h1>

    <h1>External Links</h1>
</text>

我只需要The 40-Year-Old Virgin is a 2005 American buddy comedy film about a middle-aged man's journey to finally have sex.这可能吗？感谢

Answer 1

使用XML解析器解析XML。使用lxml：

import lxml.etree as ET

content='''\
<text>
    The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.

    <h1>The plot</h1>
    Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
    <h1>Cast</h1>

    <h1>Soundtrack</h1>

    <h1>External Links</h1>
</text>
'''

text=ET.fromstring(content)
print(text.text)

产量

    The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.

Answer 2

不要使用正则表达式来解析XML / HTML。在python中使用适当的解析器，如BeautifulSoup或lxml。

Answer 3

每当您发现自己正在查看XML数据并考虑正则表达式时，您应该停下来问问自己为什么不考虑使用真正的XML解析器。 XML的结构使其非常适合于正确的解析器，并使正则表达式令人沮丧。

如果必须使用正则表达式，则应执行以下操作： 直到您的文档发生变化！

import re
p = re.compile("<text>(.*)<h1>")
p.search(xml_text).group(1)

Spoiler：正则表达式可能是合适的，如果这只是一个需要快速而肮脏的解决方案的一次性问题。或者，如果您知道输入数据是相当静态的并且不能容忍解析器的开销，那么它们可能是合适的。

Answer 4

以下是使用ElementTree：

执行此操作的方法

In [18]: import xml.etree.ElementTree as et

In [19]: t = et.parse('f.xml')

In [20]: print t.getroot().text.strip()
The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.

Answer 5

以下是使用xml.etree.ElementTree的示例：

>>> import xml.etree.ElementTree as ET
>>> data = """<text>
...     The 40-Year-Old Virgin is a 2005 American buddy comedy
...     film about a middle-aged man's journey to finally have sex.
... 
...     <h1>The plot</h1>
...     Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
...     <h1>Cast</h1>
... 
...     <h1>Soundtrack</h1>
... 
...     <h1>External Links</h1>
... </text>"""
>>> xml = ET.XML(data)
>>> xml.text
"\n    The 40-Year-Old Virgin is a 2005 American buddy comedy\n    film about a middle-aged man's journey to finally have sex.\n\n    "
>>> xml.text.strip().replace('\n   ', '')
"The 40-Year-Old Virgin is a 2005 American buddy comedy film about a middle-aged man's journey to finally have sex."

你去了！

关于正则表达式和XML

5 个答案: