python xml解析的问题

时间:2014-11-27 08:46:42

标签: python xml parsing

我是xml和REST的新手,但对python有一些基础知识。 我在尝试解析附加的xml文件时遇到了一些问题。

我使用Beautifulsoup库来解析文件,并且由于未知原因,我可以访问条目2和3的不同字段但不能访问条目1,而它们的格式都是相同的。 有人可以告诉我(附加)代码和输出我做错了吗?

<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title type="text">News</title>
    <id>1</id>
    <link href="" />
    <link href="http://192.168.1.12:8083/myWebApp/rest/listOfEntries/1/entries" rel="self" />
    <updated>2014-11-26T10:41:12.424Z</updated>
    <author />
    <entry xmlns:georss="http://www.georss.org/georss">
        <title type="html">TEST REST</title>
        <content type="html">1</content>
        <author>
            <name>User213</name>
        </author>
        <summary type="html">Test PUT Entry 3</summary>
        <id>7</id>
        <georss:point>21.94420760726878 17.44</georss:point>
        <updated>2014-11-24T09:55:31.000Z</updated>
        <link href="http://192.168.1.12:8083/myWebApp/rest/listOfEntries/1/7" rel="self" type="application/atom+xml" length="0" />
        <link href="http://192.168.1.12:8083/myWebApp/rest/listOfEntries/1/7/editEntry" rel="edit" type="application/atom+xml" length="0" />
        <link href="http://192.168.1.12:8083/myWebApp/rest/listOfEntries/1/7/comments" rel="replies" type="application/atom+xml" length="0" />
    </entry>
    <entry xmlns:georss="http://www.georss.org/georss">
        <title type="html">TEST REST</title>
        <content type="html">1</content>
        <author>
            <name>User213</name>
        </author>
        <summary type="html">Test PUT Entry 8</summary>
        <id>8</id>
        <georss:point>21.94420760726878 17.44</georss:point>
        <updated>2014-11-24T13:47:09.000Z</updated>
        <link href="http://192.168.1.12:8083/myWebApp/rest/listOfEntries/1/8" rel="self" type="application/atom+xml" length="0" />
        <link href="http://192.168.1.12:8083/myWebApp/rest/listOfEntries/1/8/editEntry" rel="edit" type="application/atom+xml" length="0" />
        <link href="http://192.168.1.12:8083/myWebApp/rest/listOfEntries/1/8/comments" rel="replies" type="application/atom+xml" length="0" />
    </entry>
    <entry xmlns:georss="http://www.georss.org/georss">
        <title type="html">TEST REST</title>
        <content type="html">1</content>
        <author>
            <name>User213</name>
        </author>
        <summary type="html">Test POST</summary>
        <id>12</id>
        <georss:point>21.94420760726878 17.44</georss:point>
        <updated>2014-11-25T14:29:02.000Z</updated>
        <link href="http://192.168.1.12:8083/myWebApp/rest/listOfEntries/1/12" rel="self" type="application/atom+xml" length="0" />
        <link href="http://192.168.1.12:8083/myWebApp/rest/listOfEntries/1/12/editEntry" rel="edit" type="application/atom+xml" length="0" />
        <link href="http://192.168.1.12:8083/myWebApp/rest/listOfEntries/1/12/comments" rel="replies" type="application/atom+xml" length="0" />
    </entry>
</feed>

Python代码:

#!/usr/bin/python
from BeautifulSoup import BeautifulSoup
handler = open("/tmp/test.xml").read()

results = soup.findAll('entry')
for r in results:
    print r
    print r.find('title').text
    print r.find('content').text
    print r.find('georss:point')
    print r.find('id')
    print r.find('updated')

输出如下:

<entry xmlns:georss="http://www.georss.org/georss">
<title type="html">TEST REST</title>
<content type="html">1</content>
</entry>
TEST REST
1
None
None
None
<entry xmlns:georss="http://www.georss.org/georss">
<title type="html">TEST REST</title>
<content type="html">1</content>
<author>
<name>User213</name>
</author>
<summary type="html">Test PUT Entry 8</summary>
<id>8</id>
<georss:point>21.94420760726878 17.44</georss:point>
<updated>2014-11-24T13:47:09.000Z</updated>
<link href="http://192.168.20.223:8083/myWebApp/rest/listOfEntries/1/8" rel="self" type="application/atom+xml" length="0" />
<link href="http://192.168.20.223:8083/myWebApp/rest/listOfEntries/1/8/editEntry" rel="edit" type="application/atom+xml" length="0" />
<link href="http://192.168.20.223:8083/myWebApp/rest/listOfEntries/1/8/comments" rel="replies" type="application/atom+xml" length="0" />
</entry>
TEST REST
1
<georss:point>21.94420760726878 17.44</georss:point>
<id>8</id>
<updated>2014-11-24T13:47:09.000Z</updated>
<entry xmlns:georss="http://www.georss.org/georss">
<title type="html">TEST REST</title>
<content type="html">1</content>
<author>
<name>User213</name>
</author>
<summary type="html">Test POST</summary>
<id>12</id>
<georss:point>21.94420760726878 17.44</georss:point>
<updated>2014-11-25T14:29:02.000Z</updated>
<link href="http://192.168.20.223:8083/myWebApp/rest/listOfEntries/1/12" rel="self" type="application/atom+xml" length="0" />
<link href="http://192.168.20.223:8083/myWebApp/rest/listOfEntries/1/12/editEntry" rel="edit" type="application/atom+xml" length="0" />
<link href="http://192.168.20.223:8083/myWebApp/rest/listOfEntries/1/12/comments" rel="replies" type="application/atom+xml" length="0" />
</entry>
TEST REST
1
<georss:point>21.94420760726878 17.44</georss:point>
<id>12</id>
<updated>2014-11-25T14:29:02.000Z</updated>

1 个答案:

答案 0 :(得分:1)

通过以下代码测试:

#!/usr/bin/python
from BeautifulSoup import BeautifulSoup
handler = open("./test.xml").read()

soup = BeautifulSoup(handler)
print soup.prettify()

输出就是这样:

<?xml version='1.0' encoding='utf-8'?>
<feed xmlns="http://www.w3.org/2005/Atom">
 <title type="text">
  News
 </title>
 <id>
  1
 </id>
 <link href="" />
 <link href="http://192.168.1.12:8083/myWebApp/rest/listOfEntries/1/entries" rel="self" />
 <updated>
  2014-11-26T10:41:12.424Z
 </updated>
 <author>
  <entry xmlns:georss="http://www.georss.org/georss">
   <title type="html">
    TEST REST
   </title>
   <content type="html">
    1
   </content>
  </entry>
 </author>
 <author>
  <name>
   User213
  </name>
 </author>

如果仔细观察,您会发现在您的xml中,<author />被BeautifulSoup视为开放标记。

这就是为什么你找不到标题,内容......因为对他而言,他们不在标签中。

希望这会有所帮助