我们有以下xml文件:
<?xml version="1.0" encoding="utf-8"?>
<doc id="ENG_DF_000170_20150219_F0010008Z">
<post id="p10" author="Kosh" datetime="2015-02-19T21:33:00">
<quote orig_author="Luddly Neddite">
<quote orig_author="zeke">
<quote orig_author="Luddly Neddite">
<quote orig_author="occupied">
Don't forget the fucking Moonies.
</quote>
The Bushes have middle east oil money behind them. They are owned by such as the bin Laden's and Saudi Prince Alwaleed bin Talal.
That's in addition to the Koch/Adelson openly buying elections.
</quote>
I think the Repubs have a brilliant strategy by running Bush 3. And Clinton 2.
It will allow the hyper partisans on both sides to make the decision as to who will be president.
Because people like me will just say fuck it to voting. If these two represent the very best that America has to offer in the form of leadership, we are royally and truly fucked.
And I am done voting. Not that my vote means much anyway.
</quote>
It's being reported that of the 21 people reportedly advising Jeb Bush, 19 are veterans of the first Bush administration, the second Bush administration, or in a few cases, both.
Some of the more notable names are Secretary of State (James Baker), his brother’s Deputy Defense Secretary (Paul Wolfowitz), his brother’s National Security Adviser (Stephen Hadley),
a variety of members from his brother’s cabinet (Tom Ridge and Michael Chertoff).
</quote>
So why does the far left care? None of you far left drones will vote for him anyway, so what difference does it make?
</post>
</doc>
我们想要找到标记post
。然后以递归方式遍历标记quote
并在<quote> </quote>
之间打印文本。
我们使用了以下python代码。我们调用函数findall('.//quote')
以允许我们以递归方式检索标记。
#! /usr/bin/python
# -*- coding: utf-8 -*-
import re, sys, io
import xml.etree.ElementTree as ET
import os
def search_for_query(path):
paragraphs = ""
tree = ET.parse(path)
root = tree.getroot()
for i in range(0,len(root)):
#retrieve data from post
if root[i].tag == "post":
#recursively retrieve quote
quotes = root[i].findall('.//quote')
for quote in quotes:
print quote.get("orig_author")
print quote.text
if __name__ == "__main__":
queries_xml = sys.argv[1]
search_for_query(queries_xml)
问题在于它正在跳过除第一个之外的所有文本:
Luddly Neddite
zeke
Luddly Neddite
occupied
Don't forget the fucking Moonies.
我认为我错过了Element.findall()
/
定义是
Element.findall()仅查找具有直接标记的元素 当前元素的子元素
所以是的,我不是在研究引用的子元素。
答案 0 :(得分:1)
因为只有每个元素中的第一个文本节点将存储为元素的text
。以其他子元素开头的文本节点将存储为相应子元素的tail
。您可以使用以下逻辑获取给定父元素的所有直接子文本节点。它只是将第一个文本节点与所有子逗号子元素的tail
组合在一起,如果有的话:
def get_text(element):
return element.text + \
''.join(c.tail for c in element.findall('*') if c.tail is not None)
快速测试:
>>> for i in range(0,len(root)):
... #retrieve data from post
... if root[i].tag == "post":
... #recursively retrieve quote
... quotes = root[i].findall('.//quote')
... for quote in quotes:
... print quote.get("orig_author")
... print get_text(quote)
...
Luddly Neddite
It's being reported that of the 21 people reportedly advising Jeb Bush, 19 are veterans of the first Bush administration, the second Bush administration, or in a few cases, both. Some of the more notable names are Secretary of State (James Baker), his brother’s Deputy Defense Secretary (Paul Wolfowitz), his brother’s National Security Adviser (Stephen Hadley), a variety of members from his brother’s cabinet (Tom Ridge and Michael Chertoff).
zeke
I think the Repubs have a brilliant strategy by running Bush 3. And Clinton 2. It will allow the hyper partisans on both sides to make the decision as to who will be president. Because people like me will just say fuck it to voting. If these two represent the very best that America has to offer in the form of leadership, we are royally and truly fucked. And I am done voting. Not that my vote means much anyway.
Luddly Neddite
The Bushes have middle east oil money behind them. They are owned by such as the bin Laden's and Saudi Prince Alwaleed bin Talal. That's in addition to the Koch/Adelson openly buying elections.
occupied
Don't forget the fucking Moonies.