使用元素树提取文本时遇到问题。
我的xml文件格式为
<elecs id = 'elecs'>
<elec id = "CLM-0001" num = "0001">
<elec-text> blah blah blah </elec-text>
<elec-text> blah blah blah </elec-text>
</elec>
<elec id = "CLM-0002" num = "0002">
<elec-text> blah blah blah </elec-text>
<elec-text> blah blah blah </elec-text>
</elec>
</elecs>
我想提取标签内的所有文字
假设我们的xml文件位于变量xml
中import xml.etree.ElementTree as ET
import lxml import etree
parser = etree.XMLParser(recover = True)
contents = open(xml).read()
tree = ET.fromstring(contents, parser = parser)
elecsN = tree.find('elecs')
for element in elecsN:
print element.text
问题是,上面的代码返回空字符串。我已经在我的文档中尝试了上面的代码和其他标签,但它确实有效。我不知道为什么它这次返回空字符串。
无论如何我能解决这个问题。
非常感谢
答案 0 :(得分:1)
您可以在名称中找到直接包含文字的元素,例如elec-text
:
>>> elec_texts = tree.findall('.//elec-text')
>>> for elec_text in elec_texts:
... print elec_text.text
...
blah blah blah
blah blah blah
blah blah blah
blah blah blah
答案 1 :(得分:0)
如果你的意思是“任何方式”,你可以使用lxml。
>>> from io import StringIO
>>> html = StringIO('''\
... <elecs id = 'elecs'>
... <elec id = "CLM-0001" num = "0001">
... <elec-text> blah blah blah </elec-text>
... <elec-text> blah blah blah </elec-text>
... </elec>
... <elec id = "CLM-0002" num = "0002">
... <elec-text> blah blah blah </elec-text>
... <elec-text> blah blah blah </elec-text>
... </elec>
... </elecs>
... '''
... )
>>> from lxml import etree
>>> doc = etree.parse(html)
>>> doc.xpath('//elecs/elec/*/text()')
[' blah blah blah ', ' blah blah blah ', ' blah blah blah ', ' blah blah blah ']