Question

我正在遍历处理日期的每个维基百科页面（1月1日，1月2日，......，12月31日）。在每一页上，我都会记下那天过生日的人的名字。但是，在我的代码（4月27日）中途，我收到了这个警告：

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

然后，我马上得到一个错误：

Traceback (most recent call last):
    File "wikipedia.py", line 29, in <module>
        section = soup.find('span', id='Births').parent
AttributeError: 'NoneType' object has no attribute 'parent'

基本上，我无法弄清楚为什么在我一直到4月27日之后，它决定抛出这个警告和错误。这是4月27日的页面：

April 27...

据我所知，没有什么不同会使这种情况发生。还有一个id =“Births”的范围。

这是我调用所有内容的代码：

    site = "http://en.wikipedia.org/wiki/"+a+"_"+str(b)
    hdr = {'User-Agent': 'Mozilla/5.0'}
    req = urllib2.Request(site,headers=hdr)    
    page = urllib2.urlopen(req)
    soup = BeautifulSoup(page)

    section = soup.find('span', id='Births').parent
    births = section.find_next('ul').find_all('li')

    for x in births:
        #All the regex and parsing, don't think it's necessary to show

错误出现在以下行：

section = soup.find('span', id='Births').parent

截至4月27日，我确实掌握了大量信息（每个列表约有35,000个元素），但我认为不会出现问题。如果有人有任何想法，我会很感激。感谢

Answer 1

看起来维基百科服务器正在提供该页面gzipped：

>>> page.info().get('Content-Encoding')
'gzip'

在您的请求中不应该没有接受编码标头，但是，与其他人的服务器一起工作时，这就是生活。

有很多来源显示如何使用gzip压缩数据 - 这里是一个： http://www.diveintopython.net/http_web_services/gzip_compression.html

这是另一个： Does python urllib2 automatically uncompress gzip data fetched from webpage?

美丽的汤，在代码中途得到警告然后出错

1 个答案: