我正在尝试使用大量数据进行一些计算。计算包括简单的相关性,然而,我的数据量很大,我盯着我的电脑超过10分钟,根本没有输出。
然后我尝试使用multiprocessing.Pool
。这是我现在的代码:
from multiprocessing import Pool
from haversine import haversine
def calculateCorrelation(data_1, data_2, dist):
"""
Fill the correlation matrix between data_1 and data_2
:param data_1: dictionary {key : [coordinates]}
:param data_2: dictionary {key : [coordinates]}
:param dist: minimum distance between coordinates to be considered, in kilometers.
:return: numpy array containing the correlation between each complaint category.
"""
pool = Pool(processes=20)
data_1 = collections.OrderedDict(sorted(data_1.items()))
data_2 = collections.OrderedDict(sorted(data_2.items()))
data_1_size = len(data_1)
data_2_size = len(data_2)
corr = numpy.zeros((data_1_size, data_2_size))
for index_1, key_1 in enumerate(data_1):
for index_2, key_2 in enumerate(data_2): # Forming pairs
type_1 = data_1[key_1] # List of data in data_1 of type *i*
type_2 = data_2[key_2] # List of data in data_2 of type *j*
result = pool.apply_async(correlation, args=[type_1, type_2, dist])
corr[index_1, index_2] = result.get()
pool.close()
pool.join()
def correlation(type_1, type_2, dist):
in_range = 0
for l1 in type_2: # Coordinates of a data in data_1
for l2 in type_2: # Coordinates of a data in data_2
p1 = (float(l1[0]), float(l1[1]))
p2 = (float(l2[0]), float(l2[1]))
if haversine(p1, p2) <= dist: # Distance between two data of types *i* and *j*
in_range += 1 # Number of data in data_2 inside area of data in data_1
total = float(len(type_1) * len(type_2))
if total != 0:
return in_range / total # Correlation between category *i* and *j*
corr = calculateCorrelation(permiters_per_region, complaints_per_region, 20)
然而,速度并没有提高。似乎没有进行并行处理:
因为只有一个线程集中了几乎所有的工作。在某些时候,所有Python工作者都使用0.0%的CPU,而一个线程使用100%。
我错过了什么吗?
答案 0 :(得分:3)
在生成作业的循环中,您调用apply_async
然后等待它完成,这有效地序列化了工作。您可以将结果对象添加到队列中,并在完成所有调度工作后等待(参见下文),或者甚至转到map
方法。
def calculateCorrelation(data_1, data_2, dist):
"""
Fill the correlation matrix between data_1 and data_2
:param data_1: dictionary {key : [coordinates]}
:param data_2: dictionary {key : [coordinates]}
:param dist: minimum distance between coordinates to be considered, in kilometers.
:return: numpy array containing the correlation between each complaint category.
"""
pool = Pool(processes=20)
results = []
data_1 = collections.OrderedDict(sorted(data_1.items()))
data_2 = collections.OrderedDict(sorted(data_2.items()))
data_1_size = len(data_1)
data_2_size = len(data_2)
corr = numpy.zeros((data_1_size, data_2_size))
for index_1, key_1 in enumerate(data_1):
for index_2, key_2 in enumerate(data_2): # Forming pairs
type_1 = data_1[key_1] # List of data in data_1 of type *i*
type_2 = data_2[key_2] # List of data in data_2 of type *j*
result = pool.apply_async(correlation, args=[type_1, type_2, dist])
results.append((result, index_1, index_2))
for result, index_1, index_2 in results:
corr[index_1, index_2] = result.get()
pool.close()
pool.join()