Question

我必须获取特定标签内的每个标签和值。

例如：

<xml>
<new>
<post>
<text>New Text</text> 
<category>New Category</category>
</post>
</new>
<specific>
<line> Line.... </line> 
New Line ends ......!!!!
</specific>

Python脚本：

root = et.fromstring('Xml from path')
target_elements = root.findall('.//post')

如果我给出标签手段，我需要输出为：

预期产出：

<text>New Text</text>
<category>New Category</category>

对于标签：

输出：

<line> Line.... </line> 
 New Line ends ......!!!!

Answer 1

注意：XML片段末尾缺少</xml>标记。

content = """\
<xml>
<new>
<post>
<text>New Text</text> 
<category>New Category</category>
</post>
</new>
<specific>
<line> Line.... </line> 
New Line ends ......!!!!
</specific>
</xml>"""

使用lxml时没有真正的困难：

from lxml import etree

root = etree.XML(content)

for elem in root.findall(".//post"):
    for child in iter(elem):
        print(child.tag + ": " + child.text)

如果要将XML片段输出为字符串，只需使用tostring函数：

for elem in root.findall(".//post"):
    for child in iter(elem):
        print(etree.tostring(child, encoding="unicode", with_tail=False))

你会得到：

<text>New Text</text>
<category>New Category</category>

要进一步了解，请阅读在线教程：http://lxml.de/tutorial.html

Answer 2

我会选择Beautifulsoup

from bs4 import BeautifulSoup

xml_doc = '''<xml>
<new>
<post>
<text>New Text</text>
<category>New Category</category>
</post>
</new>
<specific>
<line> Line.... </line>
New Line ends ......!!!!
</specific>'''

soup = BeautifulSoup(xml_doc)
print(soup.find_all('post'))

输出：

[<post>
<text>New Text</text>
<category>New Category</category>
</post>]

Python脚本获取指定xml标记内的每个文本和标记

2 个答案: