Question

我正在使用requests来解析大约410K签到数据的网址。但是，这个过程会持续数小时，我不确定问题出在哪里。我之前为170万条数据做了同样的事情并且效果很好。这是我的代码：

pat = re.compile("(?P<url>https?://[^\s]+)") # always compile it
def resolve_url(text):
    url = 'before'
    long_url = 'after'
    error = 'none'
    match = pat.search(text)
    if match:
        url = match.group("url")
        try:
            long_url = requests.head(url, allow_redirects=True).url
        except requests.exceptions.RequestException as e:  
            error = e

    return (url, long_url, error)

pool = multiprocessing.Pool(200)
resolved_urls = []
for i, res in enumerate(pool.imap(resolve_url, text_with_url)):
    resolved_urls.append(res)
    if i%10000 == 0 and i > 0:
        print("%d elements have been processed, %2.5f seconds" %(i+1, time.time()-t0))
        fout = open("./yangj/resolved_urls_%d_requests.pkl"%(i+1),"w")
        pickle.dump(resolved_urls, fout)
        fout.close()
        resolved_urls = []
fout = open("./yangj/resolved_urls_last_requests.pkl","w")
pickle.dump(resolved_urls, fout)
fout.close()

我想知道问题是否是因为我需要编写代码来恢复的一些例外。我查看了requests个文档以及之前的类似问题，但我找不到匹配的答案。有什么想法来解决这个问题吗？

Python：请求挂起数小时

0 个答案: