使用Pandas / BeautifulSoup的请求出错:requests.exceptions.TooManyRedirects:超过30个重定向

时间:2018-01-22 19:01:33

标签: python pandas web-scraping beautifulsoup python-requests

我使用Python 3来抓取我从包含63,067个网页源网址的csv文件创建的Pandas数据框。 for循环应该从一个项目中删除新闻文章,放入巨大的文本文件中以便稍后进行清理。

我对Python有点生疏,这个项目是我再次开始编程的原因。我之前没有使用过BeautifulSoup,所以我遇到了一些困难,只是让for循环使用BeautifulSoup处理Pandas数据框。

这是我使用的三个数据集之一(另外两个被编程到下面的代码中,为不同的数据集重复相同的过程,这就是为什么我提到这一点)。



from bs4 import BeautifulSoup as BS
import requests, csv
import pandas as pd

negativedata = pd.read_csv('negativedata.csv')
positivedata = pd.read_csv('positivedata.csv')
neutraldata = pd.read_csv('neutraldata.csv')

negativedf = pd.DataFrame(negativedata)
positivedf = pd.DataFrame(positivedata)
neutraldf = pd.DataFrame(neutraldata)


negativeURLS = negativedf[['sourceURL']]

for link in negativeURLS.iterrows():
    url = link[1]['sourceURL']
    negative = requests.get(url)
    negative_content = negative.text

    negativesoup = BS(negative_content, "lxml")
    for text in negativesoup.find_all('a', href = True):
        text.append((text.get('href')))




我认为最终让我的for循环工作以使代码运行所有源URL。但是,我得到了错误:



Traceback (most recent call last):
  File "./datacollection.py", line 18, in <module>
    negative = requests.get(url)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in <listcomp>
    history = [resp for resp in gen] if allow_redirects else []
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 140, in resolve_redirects
    raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
&#13;
&#13;
&#13;

我知道问题出在我申请网址的时候,但由于数据框中的网页数量不足,我不确定是什么 - 或者网址是否是问题迭代通过。问题是URL还是我有太多,应该使用不同的包,如scrapy?

1 个答案:

答案 0 :(得分:0)

我建议使用像mechanize这样的模块进行抓取。 Mechanize有一种处理robots.txt的方法,如果您的应用程序从不同网站的网址抓取数据,则会更好。但在您的情况下,重定向可能是因为此处未提及标头中的用户代理(https://github.com/requests/requests/issues/3596)。以下是您如何使用请求设置标头(Sending "User-agent" using Requests library in Python)。

P.S:mechanize仅适用于python2.x。如果你想使用python3.x,还有其他选项(Installing mechanize for python 3.4)。