Question

我正在使用python中的XML文件。我有一个包含多种语言的句子的数据集，其结构如下：

<corpus>
  <sentence id="0">
    <text lang="de">...</text>
    <text lang="en">...</text>
    <text lang="fr">...</text>
    <!-- Other languages -->
    <annotations>
      <annotation lang="de">...</annotation>
      <annotation lang="en">...</annotation>
      <annotation lang="fr">...</annotation>
      <!-- Other languages -->
    </annotations>
  </sentence>
  <sentence id="1">
    <!-- Other sentence -->
  </sentence>
  <!-- Other sentences -->
</corpus>

我想要得到的是，从数据集开始，一个新的数据集仅包含英语中的句子和注释（属性“ lang”的“ en”值）。我尝试了以下解决方案：

import xml.etree.ElementTree as ET
tree = ET.parse('samplefile2.xml')
root = tree.getroot()
for sentence in root:
  if sentence.tag == 'sentence':
    for txt in sentence:
      if txt.tag == 'text':
        if txt.attrib['lang'] != 'en':
          sentence.remove(txt)
      if txt.tag == 'annotations':
        for annotation in txt:
          if annotation.attrib['lang'] != 'en':
            txt.remove(annotation)
tree.write('output.xml')

但是它似乎仅在text属性的级别上起作用，而不在annotation属性的级别上起作用。我什至尝试用增量索引sentence, txt, annotation替换诸如root[s], root[s][t], root[s][t][a]之类的解决方案元素的python端，但它没有任何效果。此外，我提供的python代码会随机插入xml文件中（诚实的我不知道这是否对解决此问题有所帮助），例如δημιουργία。

因此，我坚信问题出在嵌套标签中，但我无法弄清楚。有想法吗？

Answer 1

如果您能够使用lxml，我认为使用xpath会更容易...

XML输入（input.xml）

<corpus>
  <sentence id="0">
    <text lang="de">...</text>
    <text lang="en">...</text>
    <text lang="fr">...</text>
    <!-- Other languages -->
    <annotations>
      <annotation lang="de">...</annotation>
      <annotation lang="en">...</annotation>
      <annotation lang="fr">...</annotation>
      <!-- Other languages -->
    </annotations>
  </sentence>
  <sentence id="1">
    <!-- Other sentence -->
  </sentence>
  <!-- Other sentences -->
</corpus>

Python

from lxml import etree

target_lang = "en"

tree = etree.parse("input.xml")

# Match any element that has a child that has a lang attribute with a value other than
# target_lang. We need this element so we can remove the child from it.
for parent in tree.xpath(f".//*[*[@lang != '{target_lang}']]"):
    # Match the children that have a lang attribute with a value other than target_lang.
    for child in parent.xpath(f"*[@lang != '{target_lang}']"):
        # Remove the child from the parent.
        parent.remove(child)

tree.write("output.xml")

XML输出（output.xml）

<corpus>
  <sentence id="0">
    <text lang="en">...</text>
    <!-- Other languages -->
    <annotations>
      <annotation lang="en">...</annotation>
      <!-- Other languages -->
    </annotations>
  </sentence>
  <sentence id="1">
    <!-- Other sentence -->
  </sentence>
  <!-- Other sentences -->
</corpus>

无法在python

1 个答案: