忍受我 - 这是我的第一个mulit-threading / processing python项目。
我正在处理一个python脚本,它应该运行 n some.exe 的实例,其中每个都需要一个ID作为参数。 ID从本地sqlite数据库中提取,如果成功处理则会被删除。任何ID都不应该由多个 some.exe 一次处理(因此WORK boolean )。
我知道下面的pool.map
需要某种 iter ,但这是我的第一个项目,包括任何形式的多线程/处理,我这样做不知道如何处理它。
该脚本应该运行,直到没有ID为止 - 不断运行 n some.exe 的实例。 some.exe 每个ID可能需要1-6分钟。
如果相关,这将在Windows机器上运行。
代码部分只是伪代码,所有非必要部分都被省略了:
#!/usr/bin/python
import time, sqlite3
from datetime import datetime
from multiprocessing.pool import ThreadPool as Pool
def run_worker(lite_cur):
lite_cur.execute("SELECET ID FROM IDS WHERE WORK != 1")
found_id = lite_cur.fetchone()
lite_cur.execute("UPDATE IDS SET WORK = 1 WHERE ID = \'"+found_id+"\'")
#starting a subprocess in a pool is probably not what one should do.. help?
process = subprocess.Popen(["some.exe", found_id])
process.wait()
#how would one check if some.exe chrashed or completed successfully?
if process = "some.exe completed without errors!":
lite_cur.execute("DELETE FROM IDS WHERE ID = \'"+found_id+"\'")
else:
#do this if some.exe crashed or reported errors.
lite_cur.execute("UPDATE IDS SET WORK = 0 WHERE ID = \'"+found_id+"\'")
def run_checker(lite_cur, ids_left):
time.sleep(600)
lite_cur.execute("SELECT * FROM IDS")
#may exceed 1 million, is there a better/faster way?
if len(lite_cur.fetchall()) == 0:
ids_left = False
def main():
#lite_db_name will be implemented as an argument.
lite_db_name = "some.db"
lite_con = sqlite3.connect(lite_db_name)
lite_cur = lite_con.cursor()
#IDs should be self-explanatory and WORK is used as a boolean to define if a worker is already working on this ID
lite_cur.execute("CREATE TABLE IF NOT EXISTS IDS(ID TEXT, WORK INTEGER DEFAULT 0)")
#max_worker will be implemented as an argument
max_worker = 4
worker_pool = Pool(max_worker)
#a pool with the limit of 1 is probably dumb as duck..
checker_pool = Pool(1)
lite_cur.execute("SELECT * FROM IDS")
if len(lite_cur.fetchall()) > 0:
ids_left = True
else:
ids_left = False
while ids_left:
worker_pool.map(run_worker(lite_cur))
checker_pool.map(run_checker(lite_cur, ids_left))
end_time = datetime.now()
print ("Congratulation - All IDs processed.")
print ("It took: {}".format(end_time - start_time))
if __name__ == "__main__":
main()
我非常感谢任何建议和意见。
编辑:抱歉没有发布明确的问题。这个问题的目的是为任何进一步的开发提供一些主要建议。
答案 0 :(得分:1)
示例代码(功能,测试):
import sqlite3 as sql
from this import s as nonsense
import subprocess
import shlex
import time
max_parallel_processes = 10
def getdb(tableid = "test"):
dbid = ":memory:"
stmt_create = "CREATE TABLE %s (id int, comment text)" % tableid
stmt_insert = "INSERT INTO %s VALUES (?, ?)" % tableid
values = enumerate(nonsense.split())
db = sql.connect(dbid)
db.execute(stmt_create)
db.executemany(stmt_insert, values)
return db
def get_ids(db, tableid = "test"):
stmt_select_id = "SELECT id from %s " % tableid
crs = db.execute(stmt_select_id)
result = crs.fetchall()
for i in result:
yield i
def main():
from random import randint
db = getdb()
process_lst = {}
sleep_between_polls_in_seconds = 0.1
for rowid in get_ids(db):
if len(process_lst) < max_parallel_processes:
cmd_str = "sleep %s" % randint(1, 3)
cmd = shlex.split(cmd_str)
print "adding : %s (%s)" % (rowid, cmd_str)
proc = subprocess.Popen(cmd)
process_lst[proc] = rowid
proc.poll()
else:
print "max processes (%s) reached" % max_parallel_processes
for proc in process_lst.keys():
finished = proc.poll() is not None
if finished:
print "%s finished" % process_lst[proc]
del process_lst[proc]
time.sleep(sleep_between_polls_in_seconds)
print "All processes processed: %s " %(len (process_lst) == 0)
if __name__ == "__main__":
main()
在我的示例中,我没有测试调用的子进程的输出(stderr,stdout),您的代码似乎必须这样做,但这很容易通过Popen构造函数实现。
此外,重定向stdout / stderr可能允许用time.sleep
构造替换select
暂停轮询循环(至少在* nix环境下)。
通过这种方式,您可以通过避免可怕的线程来实现并行化。请注意,subprocess.Popen
和threading.Thread
都会产生自己的流程。将子进程进程附加到一个线程进程中,并没有获得任何功能,这是一个相当大的开销。您仍然需要建立类似队列的结构。
希望有所帮助。
修改强>:
需要proc.poll()
有两个原因:
将stdout=subprocess.PIPE
添加到Popen允许在我的示例中读取std输出文件Popen().stdout
:
with proc.stdout as f:
program_output = f.read()