Question

我的工作流程很慢，可能需要进行一些优化，因为花了我很多时间才能运行它。我有一本名为“ databaseHash”的字典，其中的KEY：从纪元（一个时间点）开始，以秒为单位，而VALUE：数据帧，每一行一列（包含该时间点的特征数据）。我试图将一个数据帧放在一起进行时间序列分析，所以基本上我是从该字典中获取多个时间点，将它们按列进行连接，重命名列名称，使其具有唯一性和顺序性，然后将这些行进行连接它们按行排列以形成最终数据帧。

def getExampleRow(inputs):

    global databaseHash 
    #KEY: (int) timepoint in seconds since epoch
    #VALUE: (dataFrame) N Columns x 1 Row

    concatThese = [] #To hold all the dataFrames for a single row

    count = 0 #To create unique, sequential names for each set of columns from each timepoint.

    #Go through the range of desired timepoints
    for currentTimepoint in range(inputs[0], inputs[1], inputs[3]):
        concatThese.append(databaseHash[currentTimepoint].add_suffix("_"+str(count)))
        #For each timepoint, append the dataframe and add a suffix to each column name.
        count += 1

    #Target timepoints are the data points in the future that the previously appended rows are intended to predict. 
    targetCount = 0
    #For each target timepoint (predicting multiple future timepoints)
    for targetTimepoint in inputs[2]:
        #Do the same concatenating as the previous loop
        concatThese.append(databaseHash[targetTimepoint].add_suffix('_target_'+str(inputs[4][targetCount])))
        #Change the column names for the target rows
    targetCount += 1

return pd.concat(concatThese, axis=1) #concat and return the single row dataframe



parallelInputs = []
#Generate all the appropriate time points, here for reference, probably not important to optimize. 
while max(targetTimepoints) < largestTimepoint:

    parallelInputs.append((startingTrainingTimepoint, endingTrainingTimepoint, targetTimepoints, secondsSpacing, numBufferPoints))
    #Create the list of inputs for multiprocessing

    ####UPDATE THE VARIABLES####
    offset += secondsSpacing
    startingTrainingTimepoint = earliestTimepoint + offset
    endingTrainingTimepoint = startingTrainingTimepoint+numTrainingPoints*secondsSpacing
    targetTimepoints = [endingTrainingTimepoint + x*secondsSpacing for x in numBufferPoints]
    ############################

df = None
#Run in parallel and calculate all example rows
    results = Parallel(n_jobs=50, verbose=7)(delayed(getExampleRow)(i) for i in parallelInputs)
df = pd.concat(results, axis=0)
results = None
gc.collect()

我在getExampleRow（）函数上运行了探查器，结果如下：

Timer unit: 1e-06 s

Total time: 0.994672 s
File: <ipython-input-5-631cda7694d5>
Function: getExampleRow at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 1                                           def getExampleRow(inputs):
 2                                               #onle input index 2 is an array, the taret Timepoint
 3                                               global databaseHash 
 4         1          4.0      4.0      0.0      concatThese = []
 5         1          2.0      2.0      0.0      count = 0
 6       541        820.0      1.5      0.1      for currentTimepoint in range(inputs[0], inputs[1], inputs[3]):
 7       540     824867.0   1527.5     82.9          concatThese.append(databaseHash[currentTimepoint].add_suffix("_"+str(count)))
 8       540       1525.0      2.8      0.2          count += 1
 9                                                   
10                                               #Add all of the the target timepoints at the end
11         1          2.0      2.0      0.0      targetCount = 0
12        16         29.0      1.8      0.0      for targetTimepoint in inputs[2]: 
13        15      22245.0   1483.0      2.2          concatThese.append(databaseHash[targetTimepoint].add_suffix('_target_'+str(inputs[4][targetCount])))
14        15         40.0      2.7      0.0          targetCount += 1
15                                                                      
16         1     145138.0 145138.0     14.6      return pd.concat(concatThese, axis=1)

当我摆脱_suffix（）调用时，它占总时间的百分比降低。

Timer unit: 1e-06 s
Total time: 0.160778 s
File: <ipython-input-11-573f87244998>
Function: getExampleRow at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 1                                           def getExampleRow(inputs):
 2                                               #onle input index 2 is an array, the taret Timepoint
 3                                               global databaseHash 
 4         1          3.0      3.0      0.0      concatThese = []
 5         1          2.0      2.0      0.0      count = 0
 6       541        520.0      1.0      0.3      for currentTimepoint in range(inputs[0], inputs[1], inputs[3]):
 7       540       1005.0      1.9      0.6          concatThese.append(databaseHash[currentTimepoint])
 8       540        522.0      1.0      0.3          count += 1
 9                                                   
10                                               #Add all of the the target timepoints at the end
11         1          1.0      1.0      0.0      targetCount = 0
12        16         24.0      1.5      0.0      for targetTimepoint in inputs[2]: 
13        15      16415.0   1094.3     10.2          concatThese.append(databaseHash[targetTimepoint].add_suffix('_target_'+str(inputs[4][targetCount])))
14        15         36.0      2.4      0.0          targetCount += 1
15                                                                      
16         1     142250.0 142250.0     88.5      return pd.concat(concatThese, axis=1)

有没有一种快速的方法可以使我的列名在每个时间点列中唯一？

在构造用于时间序列分析的dataFrame时使列名唯一的最快方法？

0 个答案: