Question

我正在尝试从网址列表中找到无法到达的网址。代码如下：

def sanity(url,errors):
    global count
    count+=1
    if count%1000==0:
       print(count)
    try:
       if 'media' in url[:10]:
           url = "http://edoola.s3.amazonaws.com" + url
       headers={'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
    }
       req=urllib.request.Request(url,headers=headers)
       ret = urllib.request.urlopen(req)
       return 1
    except:
       print(e, url)
       errors.append(url)
       return 0

limit=1000
count=0
errors = []
with open('file.csv','r',encoding="utf-8") as file:
      text = file.read()
      text = str(text)
      urls = re.findall(r'<img.*?src=""(.*?)""[\s>]', text, flags=re.DOTALL)

arr = list(range(0,len(urls)+1,limit))
start=0
for i in arr:
    threads = [threading.Thread(target=sanity, args=(url, errors,)) for url in urls[start:i]]
    [thread.start() for thread in threads]
    [thread.join() for thread in threads]
    if i==0:
       start=0
    else:
       start=i+1

print(errors)
with open('errors_urls.txt','w') as file:
     file.write('\n'.join(errors))

该代码可以正常运行1000次，但是在我的Chrome浏览器中可以访问的下一千个打印网址中。我研究了this和others。我已经在ipython终端中尝试了这些方法，选择了特定的网址，并且效果很好。但是，当我对上面的代码使用相同的方法时。我得到了可访问的网址。我该如何解决？

网址数量约为15000。因此，在上面的代码中，我以1000个块运行，因此产生了1000个线程。

感谢您的帮助！

urlib.request中的urlopen异常行为python3

0 个答案: