Question

我编写了一些循环遍历url列表的代码，使用urllib.request打开它们，然后使用beautifulsoup解析它们。唯一的问题是列表很长（大约5000）并且代码在无限期挂起之前成功运行了大约200个URL。是否有办法a）在特定时间后跳到下一个网址，例如30秒或b）重新尝试打开网址一定次数，然后转到下一个项目？

from bs4 import BeautifulSoup
import csv
import urllib.request
with open('csv_file.csv', 'r') as f:
  reader = csv.reader(f)
  urls_list = list(reader)
  for j in range(0, len(urls_list)):
    url= ''.join(urls_list[j])
    id=url[-10:].replace(".html","")

    from urllib.request import Request, urlopen
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    s = urlopen(req).read()
    soup = BeautifulSoup(s, "lxml")

任何建议都非常感谢！

Answer 1

doc（python 2）说：

urllib2模块定义了以下功能： urllib2.urlopen（url [，data [，timeout [，cafile [，capath [，cadefault [，context]]]]]）打开URL url，可以是字符串或Request对象。

像这样调整你的代码：

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
    s = urlopen(req,timeout=10).read()   # 10 seconds
exception HTTPError as e:
    print(str(e))  # print error detail (this may not be a timeout after all!)
    continue   # skip to next element

使用urllib.request循环遍历整个URL列表时，Python会挂起

1 个答案: