Question

我有10,000个ID的测试列表，这就是我要做的事情：

对于每个测试ID，通过与其他ID（即来自同一公司的人）进行比较来计算排名
检查此测试ID的排名是否高于＆＃39;正常＆＃39; a）计算1000个随机选择的ID的等级（与步骤1相同）b）将这1000个等级与测试ID的等级进行比较
使用10个不同月份的数据对10,000个测试ID执行此操作（步骤1和2）。

要存储14000个ID和观察10个月的主数据，我使用sqlite，因为它使查询和排名更容易，更快。

为了减少运行时间，我正在使用多处理＆＃39;并且计算月数的计算，即在不同核心上针对不同月份计算的等级。这适用于较少数量的测试ID（＆lt; = 2000）或较少的随机等级（＆gt; = 200）但如果我并行计算所有10个月的等级并且使用1000作为每个ID的随机等级数量而不是脚本几个小时后冻结。没有提供错误。我相信SQLite是罪魁祸首，需要你的帮助来解决问题。

这是我的代码：

nproc = 10 ## Number of cores
randNum = 1000 ## Number of random ranks for each ID

def main():
    '''
    This will go through every specified column one by one, and for each entry
    a rank of entry will be computed which is comapred with ranks of randomly selected 1000 entries from same column
    '''
    ## Read master file with 14000 rows X 20 cols, each row pertains to an ID/ID,
    ## first 9 columns have info related to ID and last 10 have observed values from 10 diff. months
    resList = List with 14000 entries Eg. [(123,"ABC",.....),(234,"DEF",........)....14000n]

    ## Read test file, for which ranks to be calculated. Contains 10,000 IDs/IDs and their names
    global testIDList ## for p-value calculation
    testIDList = List with 1000 entries Eg. [(123,"ABC"),(234,"DEF")..10,000n]

    ## Create identifier SET - Used for random selection of IDs
    global idSET ## USed in rankCalcTest
    idSET = SET OF ALL IDs FROM MASTER FILE

    global trackTableName,coordsDB,chrLimit  ## Globals for all other functions

    ## Specify column numbers in master file that have values for each ID from different months
    trackList = [10,11,12,13,14,15,16,17,18,19,20] ## Columns in file with 14000 rows each. 
    ### Parallel
    allTrackPvals = PPResults(rankCalcTest,trackList)

    DO SOME PROCESSING
    SCRIPT ENDS

def rankCalcTest(col):
    '''
    Calculates ranks for test IDs using column/month specified by 'main()' function 
    '''
    DB = '%s_%s.db' % (coordsDB.split('.')[0],col) ## New DB for every column/month - Because current function is paralleized so every core works on a column/month
    conn = sqlite3.connect(DB)

    trackPvals = [] ## Temporary list that will hold ranks for single column/month
    tableCols = [col] ## Column with observed values from an month, that will be used to generate column-specific ranks

    ## Make sqlite3 table for current track
    trackTableName = 'track_%s' % (col) ## Here a table is created containing all IDs and observations from specifc column
    trackTableName = tableMaker(trackTableName,annoDict,resList,tableCols,conn) ## This modules not included in example, as it works well -uses SQLite
    chrLimit = chrLimits(trackTableName,conn) ## This module not included in examples as it works well - uses SQLite

    for ent in testIDList: ## Total 10,000 entries

        ## Generate Relative Rank for ID/ of interest
        mainID = ent[0] ## ID number
        mainRank = rankGenerator(mainID,trackTableName,chrLimit,conn) ## See below for function

        randomIDs = randomSelect(idSET,randNum)
        randomRanks = []
        for randID in randomIDs:
            randomRank = rankGenerator(randID,trackTableName,chrLimit,conn)
            randomRanks.append(randomRank)

        ### Some calculation
        probRR = DO SOME CALCULATION
        trackPvals.append(round(probRR,5))

    conn.close()
    return trackPvals

def rankGenerator(ID,trackTableName,chrLimit,conn):
    '''
    Generate a rank for each ID provided by 'rankCalcTest' function
    '''
    print ('\nRank is being calculated for ID:%s' % (ID))

    IDCoord = aDict[ID] ## Get required info to construct the read query
    company = IDCoord[0]
    intervalIDs = [] ## List to hold all the IDs in an interval
    rank = 0 ##Initialize

    cur = conn.cursor()

    print ('ID class 0')
    cur.execute("SELECT ID,hours FROM %s WHERE chr = '%s' AND start between %s and %s ORDER BY hours desc" % (trackTableName,comapny))
    intIDs = cur.fetchall()
    intervalIDs.extend(intIDs) ## There is one ore query in certain cases, removed for brewity of code

    Rank = SOME CALCULATION
    print('Relative Rank for %s: %s'% (ID,str(weigRelativeRank)))
    return Rank


def PPResults(module,alist):
    npool = Pool(int(nproc))    
    res = npool.map_async(module, alist)
    results = (res.get())
    return results

该脚本在“rankGenerator＆＃39;功能：

Rank is being calculated for ID:1423187_at
Rank is being calculated for ID:1452528_a_at

Coordinates found for:1423187_at - 8,111940709,111952915
Coordinates found for:1452528_a_at - 19,43612500,43614912
ID class 0

因为，并行执行了很难说哪个行脚本冻结但看起来像'rankGenerator＆＃39;是冰点。它与SQLite中的锁相关吗？

对不起大代码。它实际上是一个非常修剪的版本，花了我3个小时准备。我希望得到一些帮助。

AK

使用Python3多处理时，读取查询到SQLite DB挂起

0 个答案: