Question

我正在尝试执行以下操作。

我有8个核心。

我按如下方式执行8个过程，其中core_aa是将URL加载到队列中的文件名

python threaded_crawl.py core_aa --max_async_count=20 --use_headers --verbose > /tmp/core_aa.out
python threaded_crawl.py core_ab --max_async_count=20 --use_headers --verbose > /tmp/core_ab.out
python threaded_crawl.py core_ac --max_async_count=20 --use_headers --verbose > /tmp/core_ac.out
python threaded_crawl.py core_ad --max_async_count=20 --use_headers --verbose > /tmp/core_ad.out
python threaded_crawl.py core_ae --max_async_count=20 --use_headers --verbose > /tmp/core_ae.out
python threaded_crawl.py core_af --max_async_count=20 --use_headers --verbose > /tmp/core_af.out
python threaded_crawl.py core_ag --max_async_count=20 --use_headers --verbose > /tmp/core_ag.out
python threaded_crawl.py core_ah --max_async_count=20 --use_headers --verbose > /tmp/core_ah.out

每个进程是一个线程应用程序，运行20个线程，其工作是获取URL。如果我有例如60K网址和我运行一个进程完成工作，所有线程一直存在，直到队列为空
如果我运行多个进程，我会注意到线程开始慢慢死亡，例如每1000人一人死亡。想法将一个过程的60K拆分为8.总线数为20 * 8
每个流程都不共享数据。

因此，假设一个作业有一个，为什么要执行多个进程杀死线程？

我该如何解决？

class ThreadClass(threading.Thread):
def __init__(self,parms={},proxy_list=[],user_agent_list=[],use_cookies=True,fn=None,verbose=False):
        threading.Thread.__init__(self)
 def run(self):
    while page_queue.qsize()>0:
         FETCH URLS....


for page in xrange(THREAD_LIMIT):
        tc = ThreadClass(parms=parms,proxy_list=proxy_list,user_agent_list=user_agent_list,use_cookies=use_cookies,fn=fn,verbose=verbose)
        tc.start()
        while threading.activeCount()>=THREAD_LIMIT:
            time.sleep(1)
        while threading.activeCount()>1:
                time.sleep(1)

我知道如何调试并且没有错误。鉴于我有以下条件，

while threading.activeCount()>1:
                time.sleep(1)

一旦线程全部死亡，代码就会继续，即使线程应该运行直到队列为空，队列中仍有项目。

一旦活跃计数

Answer 1

.qsize()返回近似大小。不要使用page_queue.qsize() > 0来检查队列是否为空。您可以使用while True: .. page_queue.get() ..和哨兵知道完成时间，example或queue.task_done()，queue.join()组合。

在.run()方法中捕获异常以避免过早地杀死线程。

如果您需要.activeCount()个帖子，请不要使用n，只需创建n个帖子。

使你的线程守护程序能够随时中断你的程序。

如果您的程序是IO绑定的，则不需要多个进程。否则，您可以使用multiprocessing模块来管理多个进程，而不是手动启动它们。

Python和线程 - 线程如果运行多个进程，则会慢慢死亡

1 个答案: