Question

我有一个脚本，它接受一个URL并返回页面的<title>标记的值。经过几百次左右的运行后，我总是得到同样的错误：

File "/home/edmundspenser/Dropbox/projects/myfiles/titlegrab.py", line 202, in get_title
    status, response = http.request(pageurl)
File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 1390, in _request
    raise RedirectLimit("Redirected more times than rediection_limit allows.", response, content)
httplib2.RedirectLimit: Redirected more times than rediection_limit allows.

我的功能如下：

def get_title(pageurl):
    http = httplib2.Http()
    status, response = http.request(pageurl)
    x = BeautifulSoup(response, parseOnlyThese=SoupStrainer('title'))
    x = str(x)
    y = x[7:-8]
    z = y.split('-')[0]
    return z

非常简单。我使用了try和except以及time.sleep(1)给了时间，如果这是问题，可能会失败，但到目前为止没有任何效果。而且我不想pass就此问题。也许该网站限速我？

编辑：截至目前，脚本根本不起作用，它会在第一次请求时遇到错误。

我有一个超过80,000个www.wikiart.org绘图页面的json文件。对于每一个我运行我的功能来获得标题。所以：

print repr(get_title('http://www.wikiart.org/en/vincent-van-gogh/van-gogh-s-chair-1889'))

返回

"Van Gogh's Chair"

Answer 1

尝试使用Requests库。在我看来，似乎没有我见过的限速。我能够在21.6s中检索13个标题。见下文：

代码：

import requests as rq
from bs4 import BeautifulSoup as bsoup

def get_title(url):

    r = rq.get(url)
    soup = bsoup(r.content)
    title = soup.find_all("title")[0].get_text()
    print title.split(" - ")[0]

def main():

    urls = [
    "http://www.wikiart.org/en/henri-rousseau/tiger-in-a-tropical-storm-surprised-1891",
    "http://www.wikiart.org/en/edgar-degas/the-green-dancer-1879",
    "http://www.wikiart.org/en/claude-monet/dandelions",
    "http://www.wikiart.org/en/albrecht-durer/the-little-owl-1506",
    "http://www.wikiart.org/en/gustav-klimt/farmhouse-with-birch-trees-1903",
    "http://www.wikiart.org/en/jean-michel-basquiat/boxer",
    "http://www.wikiart.org/en/fernand-leger/three-women-1921",
    "http://www.wikiart.org/en/alphonse-mucha/flower-1897",
    "http://www.wikiart.org/en/alphonse-mucha/ruby",
    "http://www.wikiart.org/en/georges-braque/musical-instruments-1908",
    "http://www.wikiart.org/en/rene-magritte/the-evening-gown-1954",
    "http://www.wikiart.org/en/m-c-escher/lizard-1",
    "http://www.wikiart.org/en/johannes-vermeer/the-girl-with-a-pearl-earring"
    ]

    for url in urls:
        get_title(url)

if __name__ == "__main__":
    main()

输出：

Tiger in a Tropical Storm (Surprised!) 
The Green Dancer
Dandelions
The Little Owl
Farmhouse with Birch Trees
Boxer
Three Women
Flower
Ruby
Musical Instruments
The evening gown
Lizard
The Girl with a Pearl Earring
[Finished in 21.6s]

然而，出于个人道德，我不建议这样做。通过快速连接，您可以快速提取数据。允许刮擦每20页左右睡几秒钟也不会受伤。

编辑：一个更快的版本，使用grequests，允许进行异步请求。这样可以在2.6秒内提取相同的数据，速度提高近10倍。同样， 限制您的抓取速度超出对网站的尊重 。

import grequests as grq
from bs4 import BeautifulSoup as bsoup

def get_title(response):

    soup = bsoup(response.content)
    title = soup.find_all("title")[0].get_text()
    print title.split(" - ")[0]

def main():

    urls = [
    "http://www.wikiart.org/en/henri-rousseau/tiger-in-a-tropical-storm-surprised-1891",
    "http://www.wikiart.org/en/edgar-degas/the-green-dancer-1879",
    "http://www.wikiart.org/en/claude-monet/dandelions",
    "http://www.wikiart.org/en/albrecht-durer/the-little-owl-1506",
    "http://www.wikiart.org/en/gustav-klimt/farmhouse-with-birch-trees-1903",
    "http://www.wikiart.org/en/jean-michel-basquiat/boxer",
    "http://www.wikiart.org/en/fernand-leger/three-women-1921",
    "http://www.wikiart.org/en/alphonse-mucha/flower-1897",
    "http://www.wikiart.org/en/alphonse-mucha/ruby",
    "http://www.wikiart.org/en/georges-braque/musical-instruments-1908",
    "http://www.wikiart.org/en/rene-magritte/the-evening-gown-1954",
    "http://www.wikiart.org/en/m-c-escher/lizard-1",
    "http://www.wikiart.org/en/johannes-vermeer/the-girl-with-a-pearl-earring"
    ]

    rs = (grq.get(u) for u in urls)
    for i in grq.map(rs):
        get_title(i)

if __name__ == "__main__":
    main()

为什么我收到httplib2.RedirectLimit错误？

1 个答案: