我有{string:list}条目的字典D,我计算了一个函数 f(D [s1],D [s2]) - >浮动 对于D中的一对字符串(s1,s2)。
此外, 我创建了一个自定义矩阵类LabeledNumericMatrix,它允许我执行m [ID1,ID2] = 1.0等分配。
我需要计算f(x,y)并将结果存储在m [x,y]中,用于字符串S集合中的所有2元组,包括s1 = s2时。 这很容易编码为循环,但是当集合S的大小增加到10,000或更大的大值时,执行此代码需要相当长的时间。
我存储在标记矩阵m中的所有结果都不相互依赖。 因此,使用python的多线程或多进程服务并行化这个计算似乎很简单。 但是,由于cPython不能真正允许我通过线程同时执行f(x,y)的计算和m [x,y]的存储,似乎多进程是我唯一的选择。 但是,我不认为多进程被设计为在进程之间传递大约1GB的数据结构,例如我的标记矩阵结构包含10000x10000元素。
答案 0 :(得分:6)
Manager()返回的管理器对象控制服务器进程 保存Python对象并允许其他进程操作它们 使用代理。
Manager()返回的经理将支持类型列表 dict , 命名空间,锁定,RLock,信号量,BoundedSemaphore,条件,事件, 队列,价值和数组。
创建Pool of workers,输入队列和结果队列。
您的数据是独立的:f(D [s i ],D [s j ])是一个隐蔽的问题,与任何f无关(D [s < sub> k ],D [s l ])。此外,每对的计算时间应该相当,或者至少在相同的数量级上。
将任务划分为 n 输入集,其中 n 是您拥有的计算单位(核心,甚至是计算机)的数量。将每个输入集分配给不同的进程,然后加入输出。
答案 1 :(得分:2)
肯定不会有任何性能提升 - 它对于cpu绑定任务来说是一个不合适的工具。
答案 2 :(得分:1)
答案 3 :(得分:0)
我已经为提出的问题创建了一个解决方案(而不是“解决方案”),并且由于其他人可能觉得它很有用,我在这里发布代码。我的解决方案是Adam Matan建议的混合选项1和3。该代码包含来自我的vi会话的行号,这将有助于下面的讨论。
12 # System libraries needed by this module.
13 import numpy, multiprocessing, time
15 # Third-party libraries needed by this module.
16 import labeledMatrix
18 # ----- Begin code for this module. -----
19 from commonFunctions import debugMessage
21 def createSimilarityMatrix( fvFileHandle, fvFileParser, fvSimScorer, colIDs, rowIDs=None,
22 exceptionType=ValueError, useNumType=numpy.float, verbose=False,
23 maxProcesses=None, processCheckTime=1.0 ):
24 """Create a labeled similarity matrix from vectorial data in [fvFileHandle] that can be
25 parsed by [fvFileParser].
26 [fvSimScorer] should be a function that can return a floating point value for a pair of vectors.
28 If the matrix [rowIDs] are not specified, they will be the same as the [colIDs].
30 [exceptionType] will be raised when a row or column ID cannot be found in the vectorial data.
31 [maxProcesses] specifies the number of CPUs to use for calculation; default value is all available CPUs.
32 [processCheckTime] is the interval for checking activity of CPUs (if completed calculation or not).
34 Return: a LabeledNumericMatrix with corresponding row and column IDs."""
36 # Setup row/col ID information.
37 useColIDs = list( colIDs )
38 useRowIDs = rowIDs or useColIDs
39 featureData = fvFileParser( fvFileHandle, retainIDs=(useColIDs+useRowIDs) )
40 verbose and debugMessage( "Retrieved %i feature vectors from FV file." % len(featureData) )
41 featureIDs = featureData.keys()
42 absentIDs = [ ID for ID in set(useColIDs + useRowIDs) if ID not in featureIDs ]
43 if absentIDs:
44 raise exceptionType, "IDs %s not found in feature vector file." % absentIDs
45 # Otherwise, proceed to creation of matrix.
46 resultMatrix = labeledMatrix.LabeledNumericMatrix( useRowIDs, useColIDs, numType=useNumType )
47 calculateSymmetric = True if set( useRowIDs ) == set( useColIDs ) else False
49 # Setup data structures needed for parallelization.
50 numSubprocesses = multiprocessing.cpu_count() if maxProcesses==None else int(maxProcesses)
51 assert numSubprocesses >= 1, "Specification of %i CPUs to calculate similarity matrix." % numSubprocesses
52 dataManager = multiprocessing.Manager()
53 sharedFeatureData = dataManager.dict( featureData )
54 resultQueue = multiprocessing.Queue()
55 # Assign jobs evenly through number of processors available.
56 jobList = [ list() for i in range(numSubprocesses) ]
57 calculationNumber = 0 # Will hold total number of results stored.
58 if calculateSymmetric: # Perform calculations with n(n+1)/2 pairs, instead of n^2 pairs.
59 remainingIDs = list( useRowIDs )
60 while remainingIDs:
61 firstID = remainingIDs[0]
62 for secondID in remainingIDs:
63 jobList[ calculationNumber % numSubprocesses ].append( (firstID, secondID) )
64 calculationNumber += 1
65 remainingIDs.remove( firstID )
66 else: # Straight processing one at a time.
67 for rowID in useRowIDs:
68 for colID in useColIDs:
69 jobList[ calculationNumber % numSubprocesses ].append( (rowID, colID) )
70 calculationNumber += 1
72 verbose and debugMessage( "Completed setup of job distribution: %s." % [len(js) for js in jobList] )
73 # Define a function to perform calculation and store results
74 def runJobs( scoreFunc, pairs, featureData, resultQueue ):
75 for pair in pairs:
76 score = scoreFunc( featureData[pair[0]], featureData[pair[1]] )
77 resultQueue.put( ( pair, score ) )
78 verbose and debugMessage( "%s: completed all calculations." % multiprocessing.current_process().name )
81 # Create processes to perform parallelized computing.
82 processes = list()
83 for num in range(numSubprocesses):
84 processes.append( multiprocessing.Process( target=runJobs,
85 args=( fvSimScorer, jobList[num], sharedFeatureData, resultQueue ) ) )
86 # Launch processes and wait for them to all complete.
87 import Queue # For Queue.Empty exception.
88 for p in processes:
89 p.start()
90 assignmentsCompleted = 0
91 while assignmentsCompleted < calculationNumber:
92 numActive = [ p.is_alive() for p in processes ].count( True )
93 verbose and debugMessage( "%i/%i complete; Active processes: %i" % \
94 ( assignmentsCompleted, calculationNumber, numActive ) )
95 while True: # Empty queue immediately to avoid underlying pipe/socket implementation from hanging.
96 try:
97 pair, score = resultQueue.get( block=False )
98 resultMatrix[ pair[0], pair[1] ] = score
99 assignmentsCompleted += 1
100 if calculateSymmetric:
101 resultMatrix[ pair[1], pair[0] ] = score
102 except Queue.Empty:
103 break
104 if numActive == 0: finished = True
105 else:
106 time.sleep( processCheckTime )
107 # Result queue emptied and no active processes remaining - completed calculations.
108 return resultMatrix
109 ## end of createSimilarityMatrix()
第36-47行只是与问题定义相关的初步内容,它是原始问题的一部分。 用于绕过cPython的GIL的多处理设置在第49-56行,第57-70行用于均匀地创建细分任务。使用第57-70行中的代码而不是itertools.product,因为当行/列ID列表达到40,000左右时,产品最终会占用大量内存。
在我的第一次尝试(此处未显示)中,“try ... resultQueue.get()并指定除了......”代码实际上位于外部控制循环之外(而不是所有计算完成)。当我在9x9矩阵的单元测试中运行该版本的代码时,没有任何问题。 但是,尽管在执行之间没有改变代码中的任何内容,但我发现这个代码会挂起,但最高可达200x200或更大。
当我在1000x1000的较大矩阵上测试代码时,我注意到计算代码在Queue和矩阵赋值代码之前完成。使用cProfile,我发现一个瓶颈是默认的轮询间隔processCheckTime = 1.0(第23行),降低这个值可以提高结果的速度(参见帖子底部的时序示例)。对于Python中多处理新手的其他人来说,这可能是有用的信息。
t = 1.0:执行时间18s
t = 0.01:执行时间3s
t = 1.0:执行时间86s
t = 0.01:执行时间23s
112 def unitTest():
113 import cStringIO, os
114 from fingerprintReader import MismatchKernelReader
115 from fingerprintScorers import FeatureVectorLinearKernel
116 exampleData = cStringIO.StringIO() # 9 examples from GPCR (3,1)-mismatch descriptors, first 10 columns.
117 exampleData.write( ",AAA,AAC,AAD,AAE,AAF,AAG,AAH,AAI,AAK" + os.linesep )
118 exampleData.write( "TS1R2_HUMAN,5,2,3,6,8,6,6,7,4" + os.linesep )
119 exampleData.write( "SSR1_HUMAN,11,6,5,7,4,7,4,7,9" + os.linesep )
120 exampleData.write( "OXYR_HUMAN,27,13,14,14,15,14,11,16,14" + os.linesep )
121 exampleData.write( "ADA1A_HUMAN,7,3,5,4,5,7,3,8,4" + os.linesep )
122 exampleData.write( "TA2R_HUMAN,16,6,7,8,9,10,6,6,6" + os.linesep )
123 exampleData.write( "OXER1_HUMAN,10,6,5,7,11,9,5,10,6" + os.linesep )
124 exampleData.write( "NPY1R_HUMAN,3,3,0,2,3,1,0,6,2" + os.linesep )
125 exampleData.write( "NPSR1_HUMAN,0,1,1,0,3,0,0,6,2" + os.linesep )
126 exampleData.write( "HRH3_HUMAN,16,9,9,13,14,14,9,11,9" + os.linesep )
127 exampleData.write( "HCAR2_HUMAN,3,1,3,2,5,1,1,6,2" )
130 m = createSimilarityMatrix( exampleData, MismatchKernelReader, FeatureVectorLinearKernel, columnIDs,
131 verbose=True, )
132 m.SetOutputPrecision( 6 )
133 print m
135 ## end of unitTest()
答案 4 :(得分:0)
参考我在3月21日发布的代码附带的上一条评论,我发现multiprocessing.Pool + SQLite(pysqlite2)无法用于我的特定任务,因为发生了两个问题:
(1)使用默认连接(第一个worker除外),执行insert查询的每个其他worker进程只执行一次。 (2)当我将连接关键字更改为check_same_thread = False时,则使用完整的工作池,但只有部分查询成功,某些查询失败。当每个worker也执行time.sleep(0.01)时,查询失败次数减少了,但并不完全。 (3)不太重要的是,我可以疯狂地听到我的硬盘读/写,即使对于10个插入查询的小作业列表也是如此。
1 from multiprocessing import Pool, current_process
2 import MySQLdb
3 from numpy import random
6 if __name__ == "__main__":
8 numValues = 50000
9 tableName = "tempTable"
10 useHostName = ""
11 useUserName = "" # Insert your values here.
12 usePassword = ""
13 useDBName = ""
15 # Setup database and table for results.
16 dbConnection = MySQLdb.connect( host=useHostName, user=useUserName, passwd=usePassword, db=useDBName )
17 topCursor = dbConnection.cursor()
18 # Assuming table does not exist, will be eliminated at the end of the script.
19 topCursor.execute( 'CREATE TABLE %s (oneText TEXT, oneValue REAL)' % tableName )
20 topCursor.close()
21 dbConnection.close()
23 # Define simple function for storing results.
24 def work( storeValue ):
25 #print "%s storing value %f" % ( current_process().name, storeValue )
26 try:
27 dbConnection = MySQLdb.connect( host=useHostName, user=useUserName, passwd=usePassword, db=useDBName )
28 cursor = dbConnection.cursor()
29 cursor.execute( "SET AUTOCOMMIT=1" )
30 try:
31 query = "INSERT INTO %s VALUES ('%s',%f)" % ( tableName, current_process().name, storeValue )
32 #print query
33 cursor.execute( query )
34 except:
35 print "Query failed."
37 cursor.close()
38 dbConnection.close()
39 except:
40 print "Connection/cursor problem."
43 # Create set of values to assign
44 values = random.random( numValues )
46 # Create pool of workers
47 pool = Pool( processes=6 )
48 # Execute assignments.
49 for value in values: pool.apply_async( func=work, args=(value,) )
50 pool.close()
51 pool.join()
53 # Cleanup temporary table.
54 dbConnection = MySQLdb.connect( host=useHostName, user=useUserName, passwd=usePassword, db=useDBName )
55 topCursor = dbConnection.cursor()
56 topCursor.execute( 'DROP TABLE %s' % tableName )
57 topCursor.close()
58 dbConnection.close()