Question

问题的背景（对于核心问题不是必需的）：

我正在为7500多个请求开发一个Webscraper。我已将网址放入队列中。如果结果包含特定网址的错误，我将其重新放入队列。可能是网络错误（我在每个请求中都使用免费代理）或Lua脚本错误。因此，在为每个7500+ url运行完代码后，我将回到第一个有问题的URL，依此类推。我想将最大重试次数设置为5，这样整个过程就会像以前一样发生。但是，如果确实有问题仍然存在错误，则循环将在5次后结束。

核心问题：

如何实现一个队列，以便在该项目的任务失败时将其重新放入队列，并继续执行固定的时间。如果仍然有错误，它将退出。

我已经编写了以下代码，到目前为止，它仅适用于少量链接。但是在运行7500 url的代码之前，我需要确保不会陷入无限循环。

def main(url_list):
    final = list()
    scraped_url_success = set()

    to_scrape = Queue()
    for url in url_list:
        to_scrape.put(url)

    while True:
        try:
            target_url = to_scrape.get(timeout = 5)
            if target_url not in scraped_url_success:
                result = get_result(target_url)
                if 'error' in result.keys():
                    # This is where I am putting it back on the Queue
                    to_scrape.put(target_url)        
                else:
                    scraped_url_success.add(target_url)
                    final.append(result)
        except Empty:
            return (final)
        except Exception as e:
            print(e)
            continue

到目前为止，我唯一的想法是编写如下的Dictionary并保持计数

def main(url_list):
    counter = {} # NEW CODE
    # other codes as same as before.

    while True:
        try:
            target_url = to_scrape.get(timeout = 5)
            if target_url not in scraped_url_success:
                counter['target_url'] = 1 
                result = get_result(target_url)
                if 'error' in result.keys():
                    if counter['target_url'] < 6:
                        to_scrape.put(target_url)
                        counter['target_url'] += 1
                    else:
                        continue        
                else:
                    # Next parts are same as before

还有更好的建议吗？我的想法看起来很虚弱。另外，我真的需要在这里排队吗？通常，如果我不编写多线程代码，那么实现队列是否会增加任何值（在Web抓取的情况下）？

在Python中以最大重试限制放回队列

0 个答案: