Question

我有如下XML文件。

<?xml version="1.0" encoding="UTF-8"?><searching>
   <query>query01</query>
   <document id="0">
      <title>lord of the rings.</title>
    <snippet>
      this is a snippet of a document.
    </snippet>
      <url>http://www.google.com/</url>
   </document>
   <document id="1">
      <title>harry potter.</title>
    <snippet>
            this is a snippet of a document.
    </snippet>
      <url>http://www.google.com/</url>
   </document>
   ........ #and other documents .....

  <group id="0" size="298" score="145">
      <title>
         <phrase>GROUP A</phrase>
      </title>
      <document refid="0"/>
      <document refid="1"/>
      <document refid="84"/>
   </group>
  <group id="0" size="298" score="55">
      <title>
         <phrase>GROUP B</phrase>
      </title>
      <document refid="2"/>
      <document refid="13"/>
      <document refid="3"/>
   </group>
   </<searching>>

我想获取上面的组名以及每个组中的文档ID（及其标题）。我的想法是将文档ID和文档标题存储到字典中：

import codecs
documentID = {}    
group = {}

myfile = codecs.open("file.xml", mode = 'r', encoding = "utf8")
for line in myfile:
    line = line.strip()
    #get id from tags
    #get title from tag
    #store in documentID 


    #get group name and document reference

此外，我尝试过BeautifulSoup，但它很新。我不知道该怎么做。这是我正在做的代码。

def outputCluster(rFile):
    documentInReadFile = {}         #dictionary to store all document in readFile

    myfile = codecs.open(rFile, mode='r', encoding="utf8")
    soup = BeautifulSoup(myfile)
    # print all text in readFile:
    # print soup.prettify()

    # print soup.find+_all('title')

outputCluster("file.xml")

请给我一些建议。谢谢。

Answer 1

你看过Python's XML etree解析器了吗？网上有很多例子。

Answer 2

之前的海报有权利。 etree文档可以在这里找到：

https://docs.python.org/2/library/xml.etree.elementtree.html#module-xml.etree.ElementTree

可以帮助你。这是一个可能有效的代码示例（部分取自上面的链接）：

import xml.etree.ElementTree as ET
tree = ET.parse('your_file.xml')
root = tree.getroot()

for group in root.findall('group'):
  title = group.find('title')
  titlephrase = title.find('phrase').text
  for doc in group.findall('document'):
    refid = doc.get('refid')

或者，如果您希望ID存储在群组标记中，则可以使用id = group.get('id')代替搜索所有refid。

Answer 3

Elementree非常适合浏览XML。如果您进入文档，它将向您展示如何以多种方式操作XML，包括如何获取标记的内容。文档的一个例子是：
XML：

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

代码：

>>> for country in root.findall('country'):
...   rank = country.find('rank').text
...   name = country.get('name')
...   print name, rank
...
Liechtenstein 1
Singapore 4
Panama 68

你可以很容易地操纵你做你想做的事。

Answer 4

BeautifulSoup很好用，起初有点令人惊讶。

soup = BeautifulSoup(myfile)

汤成为整个文件，然后你必须搜索它以找到你需要的部分，例如：

group = soup.find(name="group, attrs={'id':'0', 'size':'298'}")

组现在包含标签组及其内容（它找到的第一个匹配组）：

<group>blabla its contents<tag inside it>blabla</tag inside it>etc.</group>

多次执行此操作以获取最低标签，越详细，登陆错误标签的机会就越少，然后

lastthingyoufound.find(name='phrase')

将包含您的答案，该答案仍将包含标签，因此您需要根据bs版本使用其他功能。使用findall制作可以迭代查找多个元素的列表，并随时跟踪旧标记，以便以后可以找到其他信息，而不是做汤= soup.find（...），这意味着你＆＃ 39;只寻找一个特定的东西并在两者之间丢失标签，这与做汤=找（...）。找（...）。findall（...）[ - 1] .find（例如......）[＆＃39; id＆＃39;]。

如何从Python中获取XML Tag的价值？

4 个答案: