Question

如何阅读<context>...</context>标记内的所有文字？那个<head>...<\head>代码中的<context \>代码呢？

我有一个如下所示的XML文件：

<corpus lang="english">
    <lexelt item="coach.n">
        <instance id="1">
            <context>I'll buy a train or <head>coach</head> ticket.</context>
        </instance>
        <instance id="2">
            <context>A branch line train took us to Aubagne where a <head>coach</head> picked us up for the journey up to the camp.</context>
        </instance>
    </lexelt>
</corpus>

但是当我运行我的代码来读取...中的XML文本时，我只得到文本，直到我到达标签。

import xml.etree.ElementTree as et    
inputfile = "./coach.data"    
root = et.parse(open(inputfile)).getroot()
instances = []

for corpus in root:
    for lexelt in corpus:
      for instance in lexelt:
        instances.append(instance.text)

j=1
for i in instances:
    print "instance " + j
    print "left: " + i
    print "\n"  
    j+=1

现在我只是左侧：

instance 1
left: I'll buy a train or 

instance 2
left: A branch line train took us to Aubagne where a

输出还需要上下文的右侧和头部，它应该是：

instance 1
left: I'll buy a train or 
head: coach
right:   ticket.

instance 2
left: A branch line train took us to Aubagne where a 
head: coach
right:  picked us up for the journey up to the camp.

Answer 1

首先，您的代码中存在错误。 for corpus in root不是必需的，您的根已经corpus。

你可能想要做的是：

for lexelt in root:
  for instance in lexelt:
    for context in instance:
      contexts.append(context.text)

现在，关于您的问题 - 在for context in instance块内，您可以访问所需的其他两个字符串：

访问head

context.find('head').text

访问head可以阅读context.find('head').tail元素右侧的文字根据{{3}}：

tail属性可用于保存与之关联的其他数据元素。此属性通常是字符串，但可以是任何字符串应用程序特定的对象。如果元素是从XML创建的 file该属性将包含元素结束后找到的任何文本标记和下一个标记之前。

Answer 2

在ElementTree中，您必须考虑子节点的tail属性。在您的情况下，语料库也是根。


    import xml.etree.ElementTree as et    
    inputfile = "./coach.data"    
    corpus = et.parse(open(inputfile)).getroot()

    def getalltext(elem):
        return elem.text + ''.join([getalltext(child) + child.tail for child in elem])

    instances = []
    for lexelt in corpus:
        for instance in lexelt:
            instances.append(getalltext(instance))


    j=1
    for i in instances:
        print "instance " + j
        print "left: " + i
        print "\n"  
        j+=1

为什么我没有在XML标签中获取文本？ - python elementtree

2 个答案: