Question

我目前有一个要使用Python解析的XML文件。我使用的是Python的元素树，除了我有问题外，它都能正常工作。

该文件当前看起来像：

<Instance>
  <TextContent>
    <Sentence>Hello, my name is John and his <Thing>name</Thing> is Tom.</Sentence>
  </TextContent>
<Instance>

我基本上想做的是跳过<Sentence>标记内的嵌套标记（即<Thing>）。我发现做到这一点的一种方法是使文本内容直到标签，标签的文本内容为止，然后将它们连接起来。我正在使用的代码是：

import xml.etree.ElementTree as ET


xtree = ET.parse('some_file.xml')
xroot = xtree.getroot()

for node in xroot:
    text_before = node[0][0].text
    text_nested = node[0][0][0].text

如何获取嵌套标记后面的文本部分？
更好的是，有没有一种方法可以完全忽略嵌套标记？

谢谢。

Answer 1

我稍微更改了您的源XML文件，以使句子包含两个子元素：

<Instance>
  <TextContent>
    <Sentence>Hello, my <Thing>name</Thing> is John and his <Thing>name</Thing> is Tom.</Sentence>
  </TextContent>
</Instance>

要找到 Sentence 元素，请运行：st = xroot.find('.//Sentence')。

然后定义以下生成器：

def allTextNodes(root):
    if root.text is not None:
        yield root.text
    for child in root:
        if child.tail is not None:
            yield child.tail

要查看所有直接后代文本节点的列表，请运行：

lst = list(allTextNodes(st))

结果是：

['Hello, my ', ' is John and his ', ' is Tom.']

但是要获取串联文本，作为一个变量，请运行：

txt = ''.join(allTextNodes(st))

获取：Hello, my is John and his is Tom.（注意，双精度空格， “周围”都省略了 Thing 元素。

使用Python解析XML时跳过“嵌套标签”

1 个答案: