让多个进程/线程一直运行,访问sqlite3 db直到满足条件

时间:2014-10-01 08:35:40

标签: python multithreading process sqlite multiprocessing

忍受我 - 这是我的第一个mulit-threading / processing python项目。

我正在处理一个python脚本,它应该运行 n some.exe 的实例,其中每个都需要一个ID作为参数。 ID从本地sqlite数据库中提取,如果成功处理则会被删除。任何ID都不应该由多个 some.exe 一次处理(因此WORK boolean )。

我知道下面的pool.map需要某种 iter ,但这是我的第一个项目,包括任何形式的多线程/处理,我这样做不知道如何处理它。

该脚本应该运行,直到没有ID为止 - 不断运行 n some.exe 的实例。 some.exe 每个ID可能需要1-6分钟。

如果相关,这将在Windows机器上运行。

代码部分只是伪代码,所有非必要部分都被省略了:

#!/usr/bin/python

import time, sqlite3
from datetime import datetime
from multiprocessing.pool import ThreadPool as Pool   

def run_worker(lite_cur):

    lite_cur.execute("SELECET ID FROM IDS WHERE WORK != 1")
    found_id = lite_cur.fetchone()

    lite_cur.execute("UPDATE IDS SET WORK = 1 WHERE ID = \'"+found_id+"\'")

    #starting a subprocess in a pool is probably not what one should do.. help?
    process = subprocess.Popen(["some.exe", found_id])
    process.wait()

    #how would one check if some.exe chrashed or completed successfully?
    if process = "some.exe completed without errors!":
        lite_cur.execute("DELETE FROM IDS WHERE ID = \'"+found_id+"\'")
    else:
        #do this if some.exe crashed or reported errors.
        lite_cur.execute("UPDATE IDS SET WORK = 0 WHERE ID = \'"+found_id+"\'")


def run_checker(lite_cur, ids_left):
    time.sleep(600)
    lite_cur.execute("SELECT * FROM IDS")
    #may exceed 1 million, is there a better/faster way?
    if len(lite_cur.fetchall()) == 0:
        ids_left = False

def main():

    #lite_db_name will be implemented as an argument.
    lite_db_name = "some.db"
    lite_con = sqlite3.connect(lite_db_name)
    lite_cur = lite_con.cursor()

    #IDs should be self-explanatory and WORK is used as a boolean to define if a worker is already working on this ID 
    lite_cur.execute("CREATE TABLE IF NOT EXISTS IDS(ID TEXT, WORK INTEGER DEFAULT 0)")

    #max_worker will be implemented as an argument
    max_worker = 4
    worker_pool = Pool(max_worker)
    #a pool with the limit of 1 is probably dumb as duck.. 
    checker_pool = Pool(1)

    lite_cur.execute("SELECT * FROM IDS")
    if len(lite_cur.fetchall()) > 0:
        ids_left = True
    else:
        ids_left = False

    while ids_left:
        worker_pool.map(run_worker(lite_cur))
        checker_pool.map(run_checker(lite_cur, ids_left))

    end_time = datetime.now()
    print ("Congratulation - All IDs processed.")
    print ("It took: {}".format(end_time - start_time))

if  __name__ == "__main__":
    main()

我非常感谢任何建议和意见。

编辑:抱歉没有发布明确的问题。这个问题的目的是为任何进一步的开发提供一些主要建议。

1 个答案:

答案 0 :(得分:1)

示例代码(功能,测试):

import sqlite3 as sql
from this import s as nonsense
import subprocess
import shlex
import time

max_parallel_processes = 10

def getdb(tableid = "test"):
    dbid = ":memory:"    

    stmt_create = "CREATE TABLE %s (id int, comment text)" % tableid
    stmt_insert = "INSERT INTO %s VALUES (?, ?)" % tableid

    values = enumerate(nonsense.split())

    db = sql.connect(dbid)
    db.execute(stmt_create)
    db.executemany(stmt_insert, values)
    return db


def get_ids(db, tableid = "test"):
    stmt_select_id = "SELECT id from %s " % tableid
    crs = db.execute(stmt_select_id)
    result = crs.fetchall()
    for i in result:
        yield i


def main():
    from random import randint

    db = getdb()
    process_lst = {}    
    sleep_between_polls_in_seconds = 0.1

    for rowid in get_ids(db):
        if len(process_lst) < max_parallel_processes:
            cmd_str = "sleep %s"  % randint(1, 3)
            cmd = shlex.split(cmd_str)

            print "adding : %s (%s)" % (rowid, cmd_str)

            proc = subprocess.Popen(cmd)    
            process_lst[proc] = rowid
            proc.poll()
        else:
            print "max processes (%s) reached" % max_parallel_processes

            for proc in process_lst.keys():
                finished = proc.poll() is not None           
                if finished:
                    print "%s finished" % process_lst[proc]
                    del process_lst[proc]

                time.sleep(sleep_between_polls_in_seconds)

    print "All processes processed: %s "  %(len (process_lst) == 0)



if __name__ == "__main__":
    main()

在我的示例中,我没有测试调用的子进程的输出(stderr,stdout),您的代码似乎必须这样做,但这很容易通过Popen构造函数实现。 此外,重定向stdout / stderr可能允许用time.sleep构造替换select暂停轮询循环(至少在* nix环境下)。

通过这种方式,您可以通过避免可怕的线程来实现并行化。请注意,subprocess.Popenthreading.Thread都会产生自己的流程。将子进程进程附加到一个线程进程中,并没有获得任何功能,这是一个相当大的开销。您仍然需要建立类似队列的结构。

希望有所帮助。

修改: 需要proc.poll()有两个原因:

  1. 启动外部流程
  2. 让它在后台运行(非阻塞)
  3. stdout=subprocess.PIPE添加到Popen允许在我的示例中读取std输出文件Popen().stdout

    with proc.stdout as f:
        program_output = f.read()