Question

我目前正在解析维基百科转储，试图提取一些有用的信息。解析以XML格式进行，我想只提取每个页面的文本/内容。现在我想知道如何在另一个标签内的标签内找到所有文本。我搜索了类似的问题，但只找到了那些有单一标签问题的问题。这是我想要实现的一个例子：

  <revision>
    <timestamp>2001-01-15T13:15:00Z</timestamp>
    <contributor>
      <username>Foobar</username>
      <id>65536</id>
    </contributor>
    <comment>I have just one thing to say!</comment>
    <text>A bunch of [[text]] here.</text>
    <minor />
  </revision>

  <example_tag>
    <timestamp>2001-01-15T13:15:00Z</timestamp>
    <contributor>
      <username>Foobar</username>
      <id>65536</id>
    </contributor>
    <comment>I have just one thing to say!</comment>
    <text>A bunch of [[text]] here.</text>
    <minor />
  </example_tag>

如何在文本标记内提取文本，但只有在修订树中包含它时？

Answer 1

您可以使用xml.etree.elementtree包并使用XPath查询：

import xml.etree.ElementTree as ET

root = ET.fromstring(the_xml_string)
for content in root.findall('.//revision/othertag'):
    # ... process content, for instance
    print(content.text)

（其中the_xml_string是包含XML代码的字符串。）

或者获取具有列表理解的文本元素列表：

import xml.etree.ElementTree as ET

texts = [content.text for content in ET.fromstring(the_xml_string).findall('.//revision/othertag')]

所以.text有内部文本。请注意，您必须将othertag替换为标记（例如text）。如果该标记在revision标记中可以任意深，则应使用.//revision//othertag作为XPath查询。

Python：从XML Tree中的标记内部提取文本

1 个答案: