我想使用子进程让20个写入脚本实例并行运行。假设我有一个大的网址列表,其中包含100,000个条目,我的程序应该控制我的脚本的20个实例始终在该列表上工作。我想按如下方式编写代码:
urllist = [url1, url2, url3, .. , url100000]
i=0
while number_of_subproccesses < 20 and i<100000:
subprocess.Popen(['python', 'script.py', urllist[i]]
i = i+1
我的脚本只是将内容写入数据库或文本文件。它没有输出任何东西,也不需要比网址更多的输入。
我的问题是我找不到如何获取活动子进程数的东西。我是一个新手程序员,所以每个提示和建议都是受欢迎的。我还想知道如果加载了20个子进程,while循环再次检查条件,我怎么能管理它?我想过可能会在它上面放一个while循环,比如
while i<100000
while number_of_subproccesses < 20:
subprocess.Popen(['python', 'script.py', urllist[i]]
i = i+1
if number_of_subprocesses == 20:
sleep() # wait to some time until check again
或许还有一种可能性,即while循环总是检查子进程的数量?
我还考虑过使用模块多处理,但我发现只使用子处理调用script.py而不是多处理函数非常方便。
也许有人可以帮助我并引导我走向正确的方向。非常感谢!
答案 0 :(得分:6)
采用与上述不同的方法 - 因为似乎回调不能作为参数发送:
NextURLNo = 0
MaxProcesses = 20
MaxUrls = 100000 # Note this would be better to be len(urllist)
Processes = []
def StartNew():
""" Start a new subprocess if there is work to do """
global NextURLNo
global Processes
if NextURLNo < MaxUrls:
proc = subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
print ("Started to Process %s", urllist[NextURLNo])
NextURLNo += 1
Processes.append(proc)
def CheckRunning():
""" Check any running processes and start new ones if there are spare slots."""
global Processes
global NextURLNo
for p in range(len(Processes):0:-1): # Check the processes in reverse order
if Processes[p].poll() is not None: # If the process hasn't finished will return None
del Processes[p] # Remove from list - this is why we needed reverse order
while (len(Processes) < MaxProcesses) and (NextURLNo < MaxUrls): # More to do and some spare slots
StartNew()
if __name__ == "__main__":
CheckRunning() # This will start the max processes running
while (len(Processes) > 0): # Some thing still going on.
time.sleep(0.1) # You may wish to change the time for this
CheckRunning()
print ("Done!")
答案 1 :(得分:1)
在启动它们时只需保持计数,如果有任何要处理的url列表条目,则使用每个子进程的回调来启动一个回调。
e.g。假设您的子进程在结束时调用传递给它的OnExit方法:
NextURLNo = 0
MaxProcesses = 20
NoSubProcess = 0
MaxUrls = 100000
def StartNew():
""" Start a new subprocess if there is work to do """
global NextURLNo
global NoSubProcess
if NextURLNo < MaxUrls:
subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
print "Started to Process", urllist[NextURLNo]
NextURLNo += 1
NoSubProcess += 1
def OnExit():
NoSubProcess -= 1
if __name__ == "__main__":
for n in range(MaxProcesses):
StartNew()
while (NoSubProcess > 0):
time.sleep(1)
if (NextURLNo < MaxUrls):
for n in range(NoSubProcess,MaxProcesses):
StartNew()
答案 2 :(得分:1)
要保持常量的并发请求数,可以使用线程池:
#!/usr/bin/env python
from multiprocessing.dummy import Pool
def process_url(url):
# ... handle a single url
urllist = [url1, url2, url3, .. , url100000]
for _ in Pool(20).imap_unordered(process_url, urllist):
pass
要运行进程而不是线程,请从导入中删除.dummy
。