我的工作流程很慢,可能需要进行一些优化,因为花了我很多时间才能运行它。我有一本名为“ databaseHash”的字典,其中的KEY:从纪元(一个时间点)开始,以秒为单位,而VALUE:数据帧,每一行一列(包含该时间点的特征数据)。我试图将一个数据帧放在一起进行时间序列分析,所以基本上我是从该字典中获取多个时间点,将它们按列进行连接,重命名列名称,使其具有唯一性和顺序性,然后将这些行进行连接它们按行排列以形成最终数据帧。
def getExampleRow(inputs):
global databaseHash
#KEY: (int) timepoint in seconds since epoch
#VALUE: (dataFrame) N Columns x 1 Row
concatThese = [] #To hold all the dataFrames for a single row
count = 0 #To create unique, sequential names for each set of columns from each timepoint.
#Go through the range of desired timepoints
for currentTimepoint in range(inputs[0], inputs[1], inputs[3]):
concatThese.append(databaseHash[currentTimepoint].add_suffix("_"+str(count)))
#For each timepoint, append the dataframe and add a suffix to each column name.
count += 1
#Target timepoints are the data points in the future that the previously appended rows are intended to predict.
targetCount = 0
#For each target timepoint (predicting multiple future timepoints)
for targetTimepoint in inputs[2]:
#Do the same concatenating as the previous loop
concatThese.append(databaseHash[targetTimepoint].add_suffix('_target_'+str(inputs[4][targetCount])))
#Change the column names for the target rows
targetCount += 1
return pd.concat(concatThese, axis=1) #concat and return the single row dataframe
parallelInputs = []
#Generate all the appropriate time points, here for reference, probably not important to optimize.
while max(targetTimepoints) < largestTimepoint:
parallelInputs.append((startingTrainingTimepoint, endingTrainingTimepoint, targetTimepoints, secondsSpacing, numBufferPoints))
#Create the list of inputs for multiprocessing
####UPDATE THE VARIABLES####
offset += secondsSpacing
startingTrainingTimepoint = earliestTimepoint + offset
endingTrainingTimepoint = startingTrainingTimepoint+numTrainingPoints*secondsSpacing
targetTimepoints = [endingTrainingTimepoint + x*secondsSpacing for x in numBufferPoints]
############################
df = None
#Run in parallel and calculate all example rows
results = Parallel(n_jobs=50, verbose=7)(delayed(getExampleRow)(i) for i in parallelInputs)
df = pd.concat(results, axis=0)
results = None
gc.collect()
我在getExampleRow()函数上运行了探查器,结果如下:
Timer unit: 1e-06 s
Total time: 0.994672 s
File: <ipython-input-5-631cda7694d5>
Function: getExampleRow at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def getExampleRow(inputs):
2 #onle input index 2 is an array, the taret Timepoint
3 global databaseHash
4 1 4.0 4.0 0.0 concatThese = []
5 1 2.0 2.0 0.0 count = 0
6 541 820.0 1.5 0.1 for currentTimepoint in range(inputs[0], inputs[1], inputs[3]):
7 540 824867.0 1527.5 82.9 concatThese.append(databaseHash[currentTimepoint].add_suffix("_"+str(count)))
8 540 1525.0 2.8 0.2 count += 1
9
10 #Add all of the the target timepoints at the end
11 1 2.0 2.0 0.0 targetCount = 0
12 16 29.0 1.8 0.0 for targetTimepoint in inputs[2]:
13 15 22245.0 1483.0 2.2 concatThese.append(databaseHash[targetTimepoint].add_suffix('_target_'+str(inputs[4][targetCount])))
14 15 40.0 2.7 0.0 targetCount += 1
15
16 1 145138.0 145138.0 14.6 return pd.concat(concatThese, axis=1)
当我摆脱_suffix()调用时,它占总时间的百分比降低。
Timer unit: 1e-06 s
Total time: 0.160778 s
File: <ipython-input-11-573f87244998>
Function: getExampleRow at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def getExampleRow(inputs):
2 #onle input index 2 is an array, the taret Timepoint
3 global databaseHash
4 1 3.0 3.0 0.0 concatThese = []
5 1 2.0 2.0 0.0 count = 0
6 541 520.0 1.0 0.3 for currentTimepoint in range(inputs[0], inputs[1], inputs[3]):
7 540 1005.0 1.9 0.6 concatThese.append(databaseHash[currentTimepoint])
8 540 522.0 1.0 0.3 count += 1
9
10 #Add all of the the target timepoints at the end
11 1 1.0 1.0 0.0 targetCount = 0
12 16 24.0 1.5 0.0 for targetTimepoint in inputs[2]:
13 15 16415.0 1094.3 10.2 concatThese.append(databaseHash[targetTimepoint].add_suffix('_target_'+str(inputs[4][targetCount])))
14 15 36.0 2.4 0.0 targetCount += 1
15
16 1 142250.0 142250.0 88.5 return pd.concat(concatThese, axis=1)
有没有一种快速的方法可以使我的列名在每个时间点列中唯一?