我是Python和编程的新手,所以如果问题非常愚蠢,请原谅。
我一直在关注RSS抓取的this教程,但是当我试图收集相应链接到正在收集的文章的标题时,我得到Python的“列表索引超出范围”错误
这是我的代码:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
source = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()
title = re.compile('<title>(.*)</title>')
link = re.compile('<link>(.*)</link>')
find_title = re.findall(title, source)
find_link = re.findall(link, source)
literate = []
literate[:] = range(1, 16)
for i in literate:
print find_title[i]
print find_link[i]
当我只告诉它检索标题时它执行正常,但是当我想要检索标题和它们相应的链接时立即抛出索引错误。
非常感谢任何帮助。
答案 0 :(得分:6)
您可以使用feedparser
module to parse an RSS feed from a given url:
#!/usr/bin/env python
import feedparser # pip install feedparser
d = feedparser.parse('http://feeds.huffingtonpost.com/huffingtonpost/latestnews')
# .. skipped handling http errors, cacheing ..
for e in d.entries:
print(e.title)
print(e.link)
print(e.description)
print("\n") # 2 newlines
Even Critics Of Safety Net Increasingly Depend On It
http://www.huffingtonpost.com/2012/02/12/safety-net-benefits_n_1271867.html
<p>Ki Gulbranson owns a logo apparel shop, deals in
<!-- ... snip ... -->
Christopher Cain, Atlanta Anti-Gay Attack Suspect, Arrested And
Charged With Aggravated Assault And Robbery
http://www.huffingtonpost.com/2012/02/12/atlanta-anti-gay-suspect-christopher-cain-arrested_n_1271811.html
<p>ATLANTA -- Atlanta police have arrested a suspect
<!-- ... snip ... -->
使用regular expressions to parse rss(xml)可能不是一个好主意。
答案 1 :(得分:1)
我认为您使用错误的正则表达式从页面中提取链接。
>>> link = re.compile('<link rel="alternate" type="text/html" href=(.*)')
>>> find_link = re.findall(link, source)
>>> find_link[1].strip()
'"http://www.huffingtonpost.com/andrew-brandt/the-peyton-predicament-pa_b_1271834.html" />'
>>> len(find_link)
15
>>>
查看您网页的html source
,您会发现这些链接未包含在内
<link></link>
模式。
实际上模式是<link rel="alternate" type="text/html" href= links here
。
这就是你的正则表达式无效的原因。