我正在运行一个程序,该程序使用多重处理来处理3个laks行数据帧。我使用在Python中使用multiprocess.process创建的62个进程在64个核心VM上执行此操作。每个进程被喂入4900行。
奇怪的是,过程需要不同的时间才能完成。第一个进程在15分钟内完成了任务,而最后一个进程花费了70多分钟。以下是我使用的用于多处理的代码块。
import multiprocessing
# define dataframe here
data_thread = data
uid = "final" ### make sure to change uid
batch_size = 4900
counter = 0
datalen = len(data_thread)
Flag = True
processes = []
while(Flag):
start = counter*batch_size
end = min(datalen, start+batch_size)
if end>=datalen:
Flag = False
indices.append((start, end))
data_split = data_thread.iloc[start:end]
threadName = "process_"+str(counter)
processes.append(multiprocessing.Process(target=process, args = (data_split, uid, threadName, start, end, )))
counter = counter+1
startCount = 0
while(startCount<len(processes)):
t = processes[startCount]
try:
t.start()
except:
print("Error encountered while starting the process_%lf: %s"%(startCount, str(indices[startCount])))
print("Started: process_" + str(startCount))
startCount = startCount + 1
endCount = 0
while(endCount<len(processes)):
t = processes[endCount]
t.join()
print("Joined: process_" + str(endCount))
endCount = endCount + 1