我正在使用请求下载网页并使用BS4进行解析。 它可以找到一些链接;但有时它会给我以下错误
期望一个字节对象,而不是一个unicode对象
在Beautiful soup文档中,它说unicode很好:
如果传入一个字节字符串,Beautiful Soup将假定该字符串被编码为UTF-8。您可以通过传入Unicode字符串来避免这种情况。
这是我的代码:
from bs4 import BeautifulSoup
import requests
link= "http://www.nytimes.com/2014/03/03/world/europe/ukraine.html?hpw&rref=world"
def get_article(link):
p=requests.get(link).text
print p
soup=BeautifulSoup(p)
paragraphs=soup.find_all('p',class_="story-body-text story-content")
print paragraphs
return paragraphs
def get_text(paragraphs):
text=""
for paragraph in paragraphs:
text+=paragraph.text
return text
print get_text(get_article(link))
代码中给出的链接会引发错误;此链接不会http://www.nytimes.com/2014/03/02/world/asia/afghan-broadcaster-says-us-soldiers-abused-him.html?ref=world
回溯:
Message File Name Line Position Traceback C:\Users\Acer Customer\Desktop\Project\nytscraper.py 27 get_article C:\Users\Acer Customer\Desktop\Project\nytscraper.py 17 get C:\Python27\lib\site-packages\requests\api.py 55 request C:\Python27\lib\site-packages\requests\api.py 44 request C:\Python27\lib\site-packages\requests\sessions.py 383 send C:\Python27\lib\site-packages\requests\sessions.py 506 resolve_redirects C:\Python27\lib\site-packages\requests\sessions.py 168 send C:\Python27\lib\site-packages\requests\sessions.py 486 send C:\Python27\lib\site-packages\requests\adapters.py 378 ConnectionError: HTTPConnectionPool(host='www.nytimes.com', port=80): Max retries exceeded with url: /2014/03/03/world/europe/pressure-rising-as-obama-works-to-rein-in-russia.html?_r=0 (Caused by : [Errno 10054] An existing connection was forcibly closed by the remote host)