Question

我使用feedparser来获取RSS Feed数据。对于大多数可以正常工作的RSS源。但是，我知道偶然发现一个提取RSS源失败的网站（example feed）。返回结果不包含预期的键，值是一些HTML代码。

我尝试使用urllib2.Request(url)阅读Feed网址。这失败，出现HTTP Error 405: Not Allowed错误。如果我添加一个自定义标题，如

headers = {
    'Content-type' : 'text/xml',
    'User-Agent': 'Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0',
}

request = urllib2.Request(url)

我不再收到405错误，但返回的内容是一个HTML文档，其中包含一些HEAD标记和一个基本为空的BODY。在浏览器中，当我查看＆＃34;查看页面来源＆＃34;时，一切看起来都很好。 feedparser.parse也允许设置agent和request_headers，我尝试了各种代理。我仍然无法正确阅读XML，更不用说来自feedparse的解析后的Feed了。

我在这里缺少什么？

Answer 1

因此，当发出请求的客户端未使用405时，此Feed会产生User-Agent错误。试试这个：

$ curl 'http://www.propertyguru.com.sg/rss' -H 'User-Agent: hum' -o /dev/null -D- -s
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 21 May 2015 15:48:44 GMT
Content-Type: application/xml; charset=utf-8
Content-Length: 24616
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding

没有UA，你会得到：

$ curl 'http://www.propertyguru.com.sg/rss' -o /dev/null -D- -s
HTTP/1.1 405 Not Allowed
Server: nginx
Date: Thu, 21 May 2015 15:49:20 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding

Python / Feedparser：阅读RSS提要失败

1 个答案: