我有一个脚本,它接受一个URL并返回页面的<title>
标记的值。经过几百次左右的运行后,我总是得到同样的错误:
File "/home/edmundspenser/Dropbox/projects/myfiles/titlegrab.py", line 202, in get_title
status, response = http.request(pageurl)
File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 1390, in _request
raise RedirectLimit("Redirected more times than rediection_limit allows.", response, content)
httplib2.RedirectLimit: Redirected more times than rediection_limit allows.
我的功能如下:
def get_title(pageurl):
http = httplib2.Http()
status, response = http.request(pageurl)
x = BeautifulSoup(response, parseOnlyThese=SoupStrainer('title'))
x = str(x)
y = x[7:-8]
z = y.split('-')[0]
return z
非常简单。我使用了try
和except
以及time.sleep(1)
给了时间,如果这是问题,可能会失败,但到目前为止没有任何效果。而且我不想pass
就此问题。也许该网站限速我?
编辑:截至目前,脚本根本不起作用,它会在第一次请求时遇到错误。
我有一个超过80,000个www.wikiart.org绘图页面的json文件。对于每一个我运行我的功能来获得标题。所以:
print repr(get_title('http://www.wikiart.org/en/vincent-van-gogh/van-gogh-s-chair-1889'))
返回
"Van Gogh's Chair"
答案 0 :(得分:2)
尝试使用Requests
库。在我看来,似乎没有我见过的限速。我能够在21.6s中检索13个标题。见下文:
代码:
import requests as rq
from bs4 import BeautifulSoup as bsoup
def get_title(url):
r = rq.get(url)
soup = bsoup(r.content)
title = soup.find_all("title")[0].get_text()
print title.split(" - ")[0]
def main():
urls = [
"http://www.wikiart.org/en/henri-rousseau/tiger-in-a-tropical-storm-surprised-1891",
"http://www.wikiart.org/en/edgar-degas/the-green-dancer-1879",
"http://www.wikiart.org/en/claude-monet/dandelions",
"http://www.wikiart.org/en/albrecht-durer/the-little-owl-1506",
"http://www.wikiart.org/en/gustav-klimt/farmhouse-with-birch-trees-1903",
"http://www.wikiart.org/en/jean-michel-basquiat/boxer",
"http://www.wikiart.org/en/fernand-leger/three-women-1921",
"http://www.wikiart.org/en/alphonse-mucha/flower-1897",
"http://www.wikiart.org/en/alphonse-mucha/ruby",
"http://www.wikiart.org/en/georges-braque/musical-instruments-1908",
"http://www.wikiart.org/en/rene-magritte/the-evening-gown-1954",
"http://www.wikiart.org/en/m-c-escher/lizard-1",
"http://www.wikiart.org/en/johannes-vermeer/the-girl-with-a-pearl-earring"
]
for url in urls:
get_title(url)
if __name__ == "__main__":
main()
输出:
Tiger in a Tropical Storm (Surprised!)
The Green Dancer
Dandelions
The Little Owl
Farmhouse with Birch Trees
Boxer
Three Women
Flower
Ruby
Musical Instruments
The evening gown
Lizard
The Girl with a Pearl Earring
[Finished in 21.6s]
然而,出于个人道德,我不建议这样做。通过快速连接,您可以快速提取数据。允许刮擦每20页左右睡几秒钟也不会受伤。
编辑:一个更快的版本,使用grequests
,允许进行异步请求。这样可以在2.6秒内提取相同的数据,速度提高近10倍。同样, 限制您的抓取速度超出对网站的尊重 。
import grequests as grq
from bs4 import BeautifulSoup as bsoup
def get_title(response):
soup = bsoup(response.content)
title = soup.find_all("title")[0].get_text()
print title.split(" - ")[0]
def main():
urls = [
"http://www.wikiart.org/en/henri-rousseau/tiger-in-a-tropical-storm-surprised-1891",
"http://www.wikiart.org/en/edgar-degas/the-green-dancer-1879",
"http://www.wikiart.org/en/claude-monet/dandelions",
"http://www.wikiart.org/en/albrecht-durer/the-little-owl-1506",
"http://www.wikiart.org/en/gustav-klimt/farmhouse-with-birch-trees-1903",
"http://www.wikiart.org/en/jean-michel-basquiat/boxer",
"http://www.wikiart.org/en/fernand-leger/three-women-1921",
"http://www.wikiart.org/en/alphonse-mucha/flower-1897",
"http://www.wikiart.org/en/alphonse-mucha/ruby",
"http://www.wikiart.org/en/georges-braque/musical-instruments-1908",
"http://www.wikiart.org/en/rene-magritte/the-evening-gown-1954",
"http://www.wikiart.org/en/m-c-escher/lizard-1",
"http://www.wikiart.org/en/johannes-vermeer/the-girl-with-a-pearl-earring"
]
rs = (grq.get(u) for u in urls)
for i in grq.map(rs):
get_title(i)
if __name__ == "__main__":
main()