我正在尝试自学如何解析XML。我已经阅读了lxml教程,但是很难理解。到目前为止,我可以做到:
>>> from lxml import etree
>>> xml=etree.parse('ham.xml')
>>> xml
<lxml.etree._ElementTree object at 0x118de60>
但是如何从这个对象获取数据呢?它不能像xml[0]
那样编入索引,也不能迭代。
更具体地说,我正在使用this xml file而我正在尝试提取<l>
标记之间的所有内容,这些标记包含<sp>
标记,其中包含{ {1}}属性。
答案 0 :(得分:2)
您还可以查看lxml API documentation,lxml.etree._Element
page。该页面告诉您关于该类的每个属性和方法,您可能想知道。
然而,我开始阅读lxml.etree
tutorial。
但是,如果元素无法编入索引,则它是一个空标记,并且没有要检索的子节点。
要按Bernardo
查找所有行,需要一个带有命名空间映射的XPath表达式。使用什么前缀并不重要,只要它是非空字符串lxml
将它映射到正确的命名空间URL:
nsmap = {'s': 'http://www.tei-c.org/ns/1.0'}
for line in tree.xpath('.//s:sp[@who="Barnardo"]/s:l/text()', namespaces=nsmap):
print line.strip()
这会提取<l>
标记中包含的<sp who="Barnardo">
元素中的所有文字。请注意标记名称上的s:
前缀,nsmap
字典告诉lxml
要使用的命名空间。我打印这些没有周围的额外空白。
对于您的示例文档,它提供:
>>> for line in tree.xpath('.//s:sp[@who="Barnardo"]/s:l/text()', namespaces=nsmap):
... print line.strip()
...
Who's there?
Long live the king!
He.
'Tis now struck twelve; get thee to bed, Francisco.
Have you had quiet guard?
Well, good night.
If you do meet Horatio and Marcellus,
The rivals of my watch, bid them make haste.
Say,
What, is Horatio there?
Welcome, Horatio: welcome, good Marcellus.
I have seen nothing.
Sit down awhile;
And let us once again assail your ears,
That are so fortified against our story
What we have two nights seen.
Last night of all,
When yond same star that's westward from the pole
Had made his course to illume that part of heaven
Where now it burns, Marcellus and myself,
The bell then beating one,
In the same figure, like the king that's dead.
Looks 'a not like the king? mark it, Horatio.
It would be spoke to.
See, it stalks away!
How now, Horatio! you tremble and look pale:
Is not this something more than fantasy?
What think you on't?
I think it be no other but e'en so:
Well may it sort that this portentous figure
Comes armed through our watch; so like the king
That was and is the question of these wars.
'Tis here!
It was about to speak, when the cock crew.
答案 1 :(得分:1)
解析XML的一种方法是使用XPath。您可以为xpath()
ElementTree
调用xml
成员函数。
例如,打印所有<l>
元素的XML(播放行)。
subtrees = xml.xpath('//l', namespaces={'prefix': 'http://www.tei-c.org/ns/1.0'})
for l in subtrees:
print(etree.tostring(l))
lxml docs详细说明了xpath功能。
如下所述,除非指定了命名空间,否则这不起作用。遗憾的是,lxml
不支持空命名空间,但您可以更改根节点以使用名为prefix
的命名空间,该命名空间也是上面使用的名称。
<TEI xmlns:prefix="http://www.tei-c.org/ns/1.0" xml:id="sha-ham">