Question

Python +编程菜鸟在这里，所以你可能不得不忍受我。我有许多xml文件（RSS档案），我想从中提取新闻文章网址。我在Windows上使用Python 2.7.3 ......这里是我正在看的代码示例：

<feed xmlns:media="http://search.yahoo.com/mrss/" xmlns:gr="http://www.google.com/schemas/reader/atom/" xmlns:idx="urn:atom-extension:indexing" xmlns="http://www.w3.org/2005/Atom" idx:index="no" gr:dir="ltr">
<!-- 
Content-type: Preventing XSRF in IE.

 -->
<generator uri="http://www.google.com/reader">Google Reader</generator>
<id>
tag:google.com,2005:reader/feed/http://feeds.smh.com.au/rssheadlines/national.xml
</id>
<title>The Sydney Morning Herald National Headlines</title>
<subtitle type="html">
The top National headlines from The Sydney Morning Herald. For all the news, visit http://www.smh.com.au.
</subtitle>
<gr:continuation>CJPL-LnHybcC</gr:continuation>
<link rel="self" href="http://www.google.com/reader/atom/feed/http://feeds.smh.com.au/rssheadlines/national.xml?n=1000&c=%5BC%5D"/>
<link rel="alternate" href="http://www.smh.com.au/national" type="text/html"/>
<updated>2013-06-16T07:55:56Z</updated>
<entry gr:is-read-state-locked="true" gr:crawl-timestamp-msec="1371369356359">
<id gr:original-id="http://news.smh.com.au/breaking-news-sport/daley-opts-for-dugan-for-origin-two-20130616-2oc5k.html">tag:google.com,2005:reader/item/dabe358abc6c18c5</id>
<category term="user/03956512242887934409/state/com.google/read" scheme="http://www.google.com/reader/" label="read"/>
<title type="html">Daley opts for Dugan for Origin two</title>
<published>2013-06-16T07:12:11Z</published>
<updated>2013-06-16T07:12:11Z</updated>
<link rel="alternate" href="http://rss.feedsportal.com/c/34697/f/644122/s/2d5973e2/l/0Lnews0Bsmh0N0Bau0Cbreaking0Enews0Esport0Cdaley0Eopts0Efor0Edugan0Efor0Eorigin0Etwo0E20A130A6160E2oc5k0Bhtml/story01.htm" type="text/html"/>

具体来说，我想提取“原始ID”链接：

<id gr:original-id="http://news.smh.com.au/breaking-news-sport/daley-opts-for-dugan-for-origin-two-20130616-2oc5k.html">tag:google.com,2005:reader/item/dabe358abc6c18c5</id>

我最初尝试使用BeautifulSoup但是遇到了问题，从研究中我看起来像Element Tree是要走的路。首先用ET尝试过：

import xml.etree.ElementTree as ET
tree = ET.parse('thefile.xml')
root = tree.getroot()

#first_original_id = root[8][0]

parents_of_interest = root[8::]

for elem in parents_of_interest:
    print elem.items()[0][1]

到目前为止我可以解决parents_of_interest确实获取我想要的数据（作为字典列表），但for循环只返回一堆true语句，之后阅读文档，似乎这是错误的方法。

我认为this有我正在寻找的答案，但即使这是一个很好的解释，我似乎无法将其应用于我自己的情况。从那个答案我试过了：

print tree.find('//{http://www.w3.org/2005/Atom}entry}id').text

但得到了错误：

__main__:1: FutureWarning: This search is broken in 1.3 and earlier, and will be fixed in a future version.  If you rely
 on the current behaviour, change it to './/{http://www.w3.org/2005/Atom}entry}id'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'text'

对此有任何帮助将不胜感激......如果这是一个冗长的问题，我很抱歉...但我认为我会详细说明所有内容......以防万一。

Answer 1

你的xpath表达式与第一个id匹配，而不是你要查找的那个，而原始id是元素的属性，所以你应该这样写：

idelem = tree.find('./{http://www.w3.org/2005/Atom}entry/{http://www.w3.org/2005/Atom}id')
if idelem is not None:
    print idelem.get('{http://www.google.com/schemas/reader/atom/}original-id')

只会找到第一个匹配的ID，如果你想要全部，请使用findall并迭代结果。

无法使用元素树解析xml存档

1 个答案: