Python脚本获取指定xml标记内的每个文本和标记

时间:2017-01-08 17:14:48

标签: python xml

我必须获取特定标签内的每个标签和值。

例如:

<xml>
<new>
<post>
<text>New Text</text> 
<category>New Category</category>
</post>
</new>
<specific>
<line> Line.... </line> 
New Line ends ......!!!!
</specific>

Python脚本:

root = et.fromstring('Xml from path')
target_elements = root.findall('.//post')

如果我给出标签手段,我需要输出为:

预期产出:

<text>New Text</text>
<category>New Category</category>

对于标签:

输出:

<line> Line.... </line> 
 New Line ends ......!!!!

2 个答案:

答案 0 :(得分:0)

注意:XML片段末尾缺少</xml>标记。

content = """\
<xml>
<new>
<post>
<text>New Text</text> 
<category>New Category</category>
</post>
</new>
<specific>
<line> Line.... </line> 
New Line ends ......!!!!
</specific>
</xml>"""

使用lxml时没有真正的困难:

from lxml import etree

root = etree.XML(content)

for elem in root.findall(".//post"):
    for child in iter(elem):
        print(child.tag + ": " + child.text)

如果要将XML片段输出为字符串,只需使用tostring函数:

for elem in root.findall(".//post"):
    for child in iter(elem):
        print(etree.tostring(child, encoding="unicode", with_tail=False))

你会得到:

<text>New Text</text>
<category>New Category</category>

要进一步了解,请阅读在线教程:http://lxml.de/tutorial.html

答案 1 :(得分:0)

我会选择Beautifulsoup

from bs4 import BeautifulSoup

xml_doc = '''<xml>
<new>
<post>
<text>New Text</text>
<category>New Category</category>
</post>
</new>
<specific>
<line> Line.... </line>
New Line ends ......!!!!
</specific>'''

soup = BeautifulSoup(xml_doc)
print(soup.find_all('post'))

输出:

[<post>
<text>New Text</text>
<category>New Category</category>
</post>]