Question

我目前正在使用Threading / workpool进行测试;我创建400个线程，下载总共5000个URL ...问题是400个线程中的一些是“冻结”，当查看我的进程时，我看到每次运行冻结+ -15个线程，并且最终一段时间后接近1比1.

我的问题是，是否有办法让某种'计时器'/'计数器'杀死一个线程，如果它在x秒后没有完成。

# download2.py - Download many URLs using multiple threads.
import os
import urllib2
import workerpool
import datetime
from threading import Timer

class DownloadJob(workerpool.Job):
    "Job for downloading a given URL."
    def __init__(self, url):
        self.url = url # The url we'll need to download when the job runs
    def run(self):
        try:
            url = urllib2.urlopen(self.url).read()
        except:
            pass

# Initialize a pool, 400 threads in this case
pool = workerpool.WorkerPool(size=400)

# Loop over urls.txt and create a job to download the URL on each line
print datetime.datetime.now()
for url in open("urls.txt"):
    job = DownloadJob(url.strip())
    pool.put(job)

# Send shutdown jobs to all threads, and wait until all the jobs have been completed
pool.shutdown()
pool.wait()
print datetime.datetime.now()

Answer 1

问题是400个线程中的一些线程“冻结”......

这很可能是因为这条线......

url = urllib2.urlopen(self.url).read()

默认情况下，Python将永远等待远程服务器响应，因此如果您的某个URL指向忽略SYN数据包的服务器，或者只是真的慢，线程可能永远被阻止。

您可以使用urlopen() timeout参数设置一个限制，以确定线程等待远程主机响应的时间...

url = urllib2.urlopen(self.url, timeout=5).read() # Time out after 5 seconds

...或者您可以通过将这些行放在代码顶部而使用socket.setdefaulttimeout()进行全局设置...

import socket
socket.setdefaulttimeout(5) # Time out after 5 seconds

Answer 2

urlopen接受超时值，这是我认为处理它的最佳方式。

但我同意评论者认为400线程可能太多了

Python线程没有完成

2 个答案: