Python +编程菜鸟在这里,所以你可能不得不忍受我。我有许多xml文件(RSS档案),我想从中提取新闻文章网址。我在Windows上使用Python 2.7.3 ......这里是我正在看的代码示例:
<feed xmlns:media="http://search.yahoo.com/mrss/" xmlns:gr="http://www.google.com/schemas/reader/atom/" xmlns:idx="urn:atom-extension:indexing" xmlns="http://www.w3.org/2005/Atom" idx:index="no" gr:dir="ltr">
<!--
Content-type: Preventing XSRF in IE.
-->
<generator uri="http://www.google.com/reader">Google Reader</generator>
<id>
tag:google.com,2005:reader/feed/http://feeds.smh.com.au/rssheadlines/national.xml
</id>
<title>The Sydney Morning Herald National Headlines</title>
<subtitle type="html">
The top National headlines from The Sydney Morning Herald. For all the news, visit http://www.smh.com.au.
</subtitle>
<gr:continuation>CJPL-LnHybcC</gr:continuation>
<link rel="self" href="http://www.google.com/reader/atom/feed/http://feeds.smh.com.au/rssheadlines/national.xml?n=1000&c=%5BC%5D"/>
<link rel="alternate" href="http://www.smh.com.au/national" type="text/html"/>
<updated>2013-06-16T07:55:56Z</updated>
<entry gr:is-read-state-locked="true" gr:crawl-timestamp-msec="1371369356359">
<id gr:original-id="http://news.smh.com.au/breaking-news-sport/daley-opts-for-dugan-for-origin-two-20130616-2oc5k.html">tag:google.com,2005:reader/item/dabe358abc6c18c5</id>
<category term="user/03956512242887934409/state/com.google/read" scheme="http://www.google.com/reader/" label="read"/>
<title type="html">Daley opts for Dugan for Origin two</title>
<published>2013-06-16T07:12:11Z</published>
<updated>2013-06-16T07:12:11Z</updated>
<link rel="alternate" href="http://rss.feedsportal.com/c/34697/f/644122/s/2d5973e2/l/0Lnews0Bsmh0N0Bau0Cbreaking0Enews0Esport0Cdaley0Eopts0Efor0Edugan0Efor0Eorigin0Etwo0E20A130A6160E2oc5k0Bhtml/story01.htm" type="text/html"/>
具体来说,我想提取“原始ID”链接:
<id gr:original-id="http://news.smh.com.au/breaking-news-sport/daley-opts-for-dugan-for-origin-two-20130616-2oc5k.html">tag:google.com,2005:reader/item/dabe358abc6c18c5</id>
我最初尝试使用BeautifulSoup但是遇到了问题,从研究中我看起来像Element Tree是要走的路。首先用ET尝试过:
import xml.etree.ElementTree as ET
tree = ET.parse('thefile.xml')
root = tree.getroot()
#first_original_id = root[8][0]
parents_of_interest = root[8::]
for elem in parents_of_interest:
print elem.items()[0][1]
到目前为止我可以解决parents_of_interest
确实获取我想要的数据(作为字典列表),但for
循环只返回一堆true
语句,之后阅读文档,似乎这是错误的方法。
我认为this有我正在寻找的答案,但即使这是一个很好的解释,我似乎无法将其应用于我自己的情况。从那个答案我试过了:
print tree.find('//{http://www.w3.org/2005/Atom}entry}id').text
但得到了错误:
__main__:1: FutureWarning: This search is broken in 1.3 and earlier, and will be fixed in a future version. If you rely
on the current behaviour, change it to './/{http://www.w3.org/2005/Atom}entry}id'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'text'
对此有任何帮助将不胜感激......如果这是一个冗长的问题,我很抱歉...但我认为我会详细说明所有内容......以防万一。
答案 0 :(得分:0)
你的xpath表达式与第一个id匹配,而不是你要查找的那个,而原始id是元素的属性,所以你应该这样写:
idelem = tree.find('./{http://www.w3.org/2005/Atom}entry/{http://www.w3.org/2005/Atom}id')
if idelem is not None:
print idelem.get('{http://www.google.com/schemas/reader/atom/}original-id')
只会找到第一个匹配的ID,如果你想要全部,请使用findall
并迭代结果。