为什么不是Beautifulsoup接受unicode输入?

时间:2014-03-03 01:59:41

标签: python unicode beautifulsoup python-requests

我正在使用请求下载网页并使用BS4进行解析。 它可以找到一些链接;但有时它会给我以下错误

  

期望一个字节对象,而不是一个unicode对象

在Beautiful soup文档中,它说unicode很好:

  

如果传入一个字节字符串,Beautiful Soup将假定该字符串被编码为UTF-8。您可以通过传入Unicode字符串来避免这种情况。

这是我的代码:

from bs4 import BeautifulSoup
import requests
link= "http://www.nytimes.com/2014/03/03/world/europe/ukraine.html?hpw&rref=world"
def get_article(link):
    p=requests.get(link).text
    print p
    soup=BeautifulSoup(p)
    paragraphs=soup.find_all('p',class_="story-body-text story-content")
    print paragraphs
    return paragraphs
def get_text(paragraphs):
    text=""
    for paragraph in paragraphs:
       text+=paragraph.text
    return text
print get_text(get_article(link))

代码中给出的链接会引发错误;此链接不会http://www.nytimes.com/2014/03/02/world/asia/afghan-broadcaster-says-us-soldiers-abused-him.html?ref=world

回溯:

Message File Name       Line    Position       
Traceback                              
        C:\Users\Acer Customer\Desktop\Project\nytscraper.py    27             
    get_article C:\Users\Acer Customer\Desktop\Project\nytscraper.py    17             
    get C:\Python27\lib\site-packages\requests\api.py   55             
    request     C:\Python27\lib\site-packages\requests\api.py   44             
    request     C:\Python27\lib\site-packages\requests\sessions.py      383            
    send        C:\Python27\lib\site-packages\requests\sessions.py      506            
    resolve_redirects   C:\Python27\lib\site-packages\requests\sessions.py      168            
    send        C:\Python27\lib\site-packages\requests\sessions.py      486            
    send        C:\Python27\lib\site-packages\requests\adapters.py      378            
ConnectionError: HTTPConnectionPool(host='www.nytimes.com', port=80): Max retries exceeded with url: /2014/03/03/world/europe/pressure-rising-as-obama-works-to-rein-in-russia.html?_r=0 (Caused by : [Errno 10054] An existing connection was forcibly closed by the remote host)

0 个答案:

没有答案