为什么我没有在XML标签中获取文本? - python elementtree

时间:2012-09-22 06:56:27

标签: python xml elementtree readxml

如何阅读<context>...</context>标记内的所有文字?那个<head>...<\head>代码中的<context \>代码呢?

我有一个如下所示的XML文件:

<corpus lang="english">
    <lexelt item="coach.n">
        <instance id="1">
            <context>I'll buy a train or <head>coach</head> ticket.</context>
        </instance>
        <instance id="2">
            <context>A branch line train took us to Aubagne where a <head>coach</head> picked us up for the journey up to the camp.</context>
        </instance>
    </lexelt>
</corpus>

但是当我运行我的代码来读取...中的XML文本时,我只得到文本,直到我到达标签。

import xml.etree.ElementTree as et    
inputfile = "./coach.data"    
root = et.parse(open(inputfile)).getroot()
instances = []

for corpus in root:
    for lexelt in corpus:
      for instance in lexelt:
        instances.append(instance.text)

j=1
for i in instances:
    print "instance " + j
    print "left: " + i
    print "\n"  
    j+=1

现在我只是左侧:

instance 1
left: I'll buy a train or 

instance 2
left: A branch line train took us to Aubagne where a 

输出还需要上下文的右侧和头部,它应该是:

instance 1
left: I'll buy a train or 
head: coach
right:   ticket.

instance 2
left: A branch line train took us to Aubagne where a 
head: coach
right:  picked us up for the journey up to the camp.

2 个答案:

答案 0 :(得分:2)

首先,您的代码中存在错误。 for corpus in root不是必需的,您的根已经corpus

你可能想要做的是:

for lexelt in root:
  for instance in lexelt:
    for context in instance:
      contexts.append(context.text)

现在,关于您的问题 - 在for context in instance块内,您可以访问所需的其他两个字符串:

  1. 访问head
  2. 即可访问context.find('head').text文字
  3. 访问head可以阅读context.find('head').tail元素右侧的文字 根据{{​​3}}:
  4.   

    tail属性可用于保存与之关联的其他数据   元素。此属性通常是字符串,但可以是任何字符串   应用程序特定的对象。如果元素是从XML创建的   file该属性将包含元素结束后找到的任何文本   标记和下一个标记之前。

答案 1 :(得分:1)

在ElementTree中,您必须考虑子节点的tail属性。在您的情况下,语料库也是根。


    import xml.etree.ElementTree as et    
    inputfile = "./coach.data"    
    corpus = et.parse(open(inputfile)).getroot()

    def getalltext(elem):
        return elem.text + ''.join([getalltext(child) + child.tail for child in elem])

    instances = []
    for lexelt in corpus:
        for instance in lexelt:
            instances.append(getalltext(instance))


    j=1
    for i in instances:
        print "instance " + j
        print "left: " + i
        print "\n"  
        j+=1