使用Python3多处理时,读取查询到SQLite DB挂起

时间:2014-08-20 18:17:21

标签: python-3.x sqlite multiprocessing

我有10,000个ID的测试列表,这就是我要做的事情:

  1. 对于每个测试ID,通过与其他ID(即来自同一公司的人)进行比较来计算排名

  2. 检查此测试ID的排名是否高于'正常' a)计算1000个随机选择的ID的等级(与步骤1相同)b)将这1000个等级与测试ID的等级进行比较

  3. 使用10个不同月份的数据对10,000个测试ID执行此操作(步骤1和2)。

  4. 要存储14000个ID和观察10个月的主数据,我使用sqlite,因为它使查询和排名更容易,更快。

    为了减少运行时间,我正在使用多处理'并且计算月数的计算,即在不同核心上针对不同月份计算的等级。这适用于较少数量的测试ID(< = 2000)或较少的随机等级(> = 200)但如果我并行计算所有10个月的等级并且使用1000作为每个ID的随机等级数量而不是脚本几个小时后冻结。没有提供错误。我相信SQLite是罪魁祸首,需要你的帮助来解决问题。

    这是我的代码:

    nproc = 10 ## Number of cores
    randNum = 1000 ## Number of random ranks for each ID
    
    def main():
        '''
        This will go through every specified column one by one, and for each entry
        a rank of entry will be computed which is comapred with ranks of randomly selected 1000 entries from same column
        '''
        ## Read master file with 14000 rows X 20 cols, each row pertains to an ID/ID,
        ## first 9 columns have info related to ID and last 10 have observed values from 10 diff. months
        resList = List with 14000 entries Eg. [(123,"ABC",.....),(234,"DEF",........)....14000n]
    
        ## Read test file, for which ranks to be calculated. Contains 10,000 IDs/IDs and their names
        global testIDList ## for p-value calculation
        testIDList = List with 1000 entries Eg. [(123,"ABC"),(234,"DEF")..10,000n]
    
        ## Create identifier SET - Used for random selection of IDs
        global idSET ## USed in rankCalcTest
        idSET = SET OF ALL IDs FROM MASTER FILE
    
        global trackTableName,coordsDB,chrLimit  ## Globals for all other functions
    
        ## Specify column numbers in master file that have values for each ID from different months
        trackList = [10,11,12,13,14,15,16,17,18,19,20] ## Columns in file with 14000 rows each. 
        ### Parallel
        allTrackPvals = PPResults(rankCalcTest,trackList)
    
        DO SOME PROCESSING
        SCRIPT ENDS
    
    def rankCalcTest(col):
        '''
        Calculates ranks for test IDs using column/month specified by 'main()' function 
        '''
        DB = '%s_%s.db' % (coordsDB.split('.')[0],col) ## New DB for every column/month - Because current function is paralleized so every core works on a column/month
        conn = sqlite3.connect(DB)
    
        trackPvals = [] ## Temporary list that will hold ranks for single column/month
        tableCols = [col] ## Column with observed values from an month, that will be used to generate column-specific ranks
    
        ## Make sqlite3 table for current track
        trackTableName = 'track_%s' % (col) ## Here a table is created containing all IDs and observations from specifc column
        trackTableName = tableMaker(trackTableName,annoDict,resList,tableCols,conn) ## This modules not included in example, as it works well -uses SQLite
        chrLimit = chrLimits(trackTableName,conn) ## This module not included in examples as it works well - uses SQLite
    
        for ent in testIDList: ## Total 10,000 entries
    
            ## Generate Relative Rank for ID/ of interest
            mainID = ent[0] ## ID number
            mainRank = rankGenerator(mainID,trackTableName,chrLimit,conn) ## See below for function
    
            randomIDs = randomSelect(idSET,randNum)
            randomRanks = []
            for randID in randomIDs:
                randomRank = rankGenerator(randID,trackTableName,chrLimit,conn)
                randomRanks.append(randomRank)
    
            ### Some calculation
            probRR = DO SOME CALCULATION
            trackPvals.append(round(probRR,5))
    
        conn.close()
        return trackPvals
    
    def rankGenerator(ID,trackTableName,chrLimit,conn):
        '''
        Generate a rank for each ID provided by 'rankCalcTest' function
        '''
        print ('\nRank is being calculated for ID:%s' % (ID))
    
        IDCoord = aDict[ID] ## Get required info to construct the read query
        company = IDCoord[0]
        intervalIDs = [] ## List to hold all the IDs in an interval
        rank = 0 ##Initialize
    
        cur = conn.cursor()
    
        print ('ID class 0')
        cur.execute("SELECT ID,hours FROM %s WHERE chr = '%s' AND start between %s and %s ORDER BY hours desc" % (trackTableName,comapny))
        intIDs = cur.fetchall()
        intervalIDs.extend(intIDs) ## There is one ore query in certain cases, removed for brewity of code
    
        Rank = SOME CALCULATION
        print('Relative Rank for %s: %s'% (ID,str(weigRelativeRank)))
        return Rank
    
    
    def PPResults(module,alist):
        npool = Pool(int(nproc))    
        res = npool.map_async(module, alist)
        results = (res.get())
        return results
    

    该脚本在“rankGenerator'功能:

    Rank is being calculated for ID:1423187_at
    Rank is being calculated for ID:1452528_a_at
    
    Coordinates found for:1423187_at - 8,111940709,111952915
    Coordinates found for:1452528_a_at - 19,43612500,43614912
    ID class 0
    

    因为,并行执行了很难说哪个行脚本冻结但看起来像'rankGenerator'是冰点。它与SQLite中的锁相关吗?

    对不起大代码。它实际上是一个非常修剪的版本,花了我3个小时准备。我希望得到一些帮助。

    AK

0 个答案:

没有答案