我有10,000个ID的测试列表,这就是我要做的事情:
对于每个测试ID,通过与其他ID(即来自同一公司的人)进行比较来计算排名
检查此测试ID的排名是否高于'正常' a)计算1000个随机选择的ID的等级(与步骤1相同)b)将这1000个等级与测试ID的等级进行比较
使用10个不同月份的数据对10,000个测试ID执行此操作(步骤1和2)。
要存储14000个ID和观察10个月的主数据,我使用sqlite,因为它使查询和排名更容易,更快。
为了减少运行时间,我正在使用多处理'并且计算月数的计算,即在不同核心上针对不同月份计算的等级。这适用于较少数量的测试ID(< = 2000)或较少的随机等级(> = 200)但如果我并行计算所有10个月的等级并且使用1000作为每个ID的随机等级数量而不是脚本几个小时后冻结。没有提供错误。我相信SQLite是罪魁祸首,需要你的帮助来解决问题。
这是我的代码:
nproc = 10 ## Number of cores
randNum = 1000 ## Number of random ranks for each ID
def main():
'''
This will go through every specified column one by one, and for each entry
a rank of entry will be computed which is comapred with ranks of randomly selected 1000 entries from same column
'''
## Read master file with 14000 rows X 20 cols, each row pertains to an ID/ID,
## first 9 columns have info related to ID and last 10 have observed values from 10 diff. months
resList = List with 14000 entries Eg. [(123,"ABC",.....),(234,"DEF",........)....14000n]
## Read test file, for which ranks to be calculated. Contains 10,000 IDs/IDs and their names
global testIDList ## for p-value calculation
testIDList = List with 1000 entries Eg. [(123,"ABC"),(234,"DEF")..10,000n]
## Create identifier SET - Used for random selection of IDs
global idSET ## USed in rankCalcTest
idSET = SET OF ALL IDs FROM MASTER FILE
global trackTableName,coordsDB,chrLimit ## Globals for all other functions
## Specify column numbers in master file that have values for each ID from different months
trackList = [10,11,12,13,14,15,16,17,18,19,20] ## Columns in file with 14000 rows each.
### Parallel
allTrackPvals = PPResults(rankCalcTest,trackList)
DO SOME PROCESSING
SCRIPT ENDS
def rankCalcTest(col):
'''
Calculates ranks for test IDs using column/month specified by 'main()' function
'''
DB = '%s_%s.db' % (coordsDB.split('.')[0],col) ## New DB for every column/month - Because current function is paralleized so every core works on a column/month
conn = sqlite3.connect(DB)
trackPvals = [] ## Temporary list that will hold ranks for single column/month
tableCols = [col] ## Column with observed values from an month, that will be used to generate column-specific ranks
## Make sqlite3 table for current track
trackTableName = 'track_%s' % (col) ## Here a table is created containing all IDs and observations from specifc column
trackTableName = tableMaker(trackTableName,annoDict,resList,tableCols,conn) ## This modules not included in example, as it works well -uses SQLite
chrLimit = chrLimits(trackTableName,conn) ## This module not included in examples as it works well - uses SQLite
for ent in testIDList: ## Total 10,000 entries
## Generate Relative Rank for ID/ of interest
mainID = ent[0] ## ID number
mainRank = rankGenerator(mainID,trackTableName,chrLimit,conn) ## See below for function
randomIDs = randomSelect(idSET,randNum)
randomRanks = []
for randID in randomIDs:
randomRank = rankGenerator(randID,trackTableName,chrLimit,conn)
randomRanks.append(randomRank)
### Some calculation
probRR = DO SOME CALCULATION
trackPvals.append(round(probRR,5))
conn.close()
return trackPvals
def rankGenerator(ID,trackTableName,chrLimit,conn):
'''
Generate a rank for each ID provided by 'rankCalcTest' function
'''
print ('\nRank is being calculated for ID:%s' % (ID))
IDCoord = aDict[ID] ## Get required info to construct the read query
company = IDCoord[0]
intervalIDs = [] ## List to hold all the IDs in an interval
rank = 0 ##Initialize
cur = conn.cursor()
print ('ID class 0')
cur.execute("SELECT ID,hours FROM %s WHERE chr = '%s' AND start between %s and %s ORDER BY hours desc" % (trackTableName,comapny))
intIDs = cur.fetchall()
intervalIDs.extend(intIDs) ## There is one ore query in certain cases, removed for brewity of code
Rank = SOME CALCULATION
print('Relative Rank for %s: %s'% (ID,str(weigRelativeRank)))
return Rank
def PPResults(module,alist):
npool = Pool(int(nproc))
res = npool.map_async(module, alist)
results = (res.get())
return results
该脚本在“rankGenerator'功能:
Rank is being calculated for ID:1423187_at
Rank is being calculated for ID:1452528_a_at
Coordinates found for:1423187_at - 8,111940709,111952915
Coordinates found for:1452528_a_at - 19,43612500,43614912
ID class 0
因为,并行执行了很难说哪个行脚本冻结但看起来像'rankGenerator'是冰点。它与SQLite中的锁相关吗?
对不起大代码。它实际上是一个非常修剪的版本,花了我3个小时准备。我希望得到一些帮助。
AK