在构造用于时间序列分析的dataFrame时使列名唯一的最快方法?

时间:2018-09-13 19:55:36

标签: python pandas dataframe memory optimization

我的工作流程很慢,可能需要进行一些优化,因为花了我很多时间才能运行它。我有一本名为“ databaseHash”的字典,其中的KEY:从纪元(一个时间点)开始,以秒为单位,而VALUE:数据帧,每一行一列(包含该时间点的特征数据)。我试图将一个数据帧放在一起进行时间序列分析,所以基本上我是从该字典中获取多个时间点,将它们按列进行连接,重命名列名称,使其具有唯一性和顺序性,然后将这些行进行连接它们按行排列以形成最终数据帧。

def getExampleRow(inputs):

    global databaseHash 
    #KEY: (int) timepoint in seconds since epoch
    #VALUE: (dataFrame) N Columns x 1 Row

    concatThese = [] #To hold all the dataFrames for a single row

    count = 0 #To create unique, sequential names for each set of columns from each timepoint.

    #Go through the range of desired timepoints
    for currentTimepoint in range(inputs[0], inputs[1], inputs[3]):
        concatThese.append(databaseHash[currentTimepoint].add_suffix("_"+str(count)))
        #For each timepoint, append the dataframe and add a suffix to each column name.
        count += 1

    #Target timepoints are the data points in the future that the previously appended rows are intended to predict. 
    targetCount = 0
    #For each target timepoint (predicting multiple future timepoints)
    for targetTimepoint in inputs[2]:
        #Do the same concatenating as the previous loop
        concatThese.append(databaseHash[targetTimepoint].add_suffix('_target_'+str(inputs[4][targetCount])))
        #Change the column names for the target rows
    targetCount += 1

return pd.concat(concatThese, axis=1) #concat and return the single row dataframe



parallelInputs = []
#Generate all the appropriate time points, here for reference, probably not important to optimize. 
while max(targetTimepoints) < largestTimepoint:

    parallelInputs.append((startingTrainingTimepoint, endingTrainingTimepoint, targetTimepoints, secondsSpacing, numBufferPoints))
    #Create the list of inputs for multiprocessing

    ####UPDATE THE VARIABLES####
    offset += secondsSpacing
    startingTrainingTimepoint = earliestTimepoint + offset
    endingTrainingTimepoint = startingTrainingTimepoint+numTrainingPoints*secondsSpacing
    targetTimepoints = [endingTrainingTimepoint + x*secondsSpacing for x in numBufferPoints]
    ############################

df = None
#Run in parallel and calculate all example rows
    results = Parallel(n_jobs=50, verbose=7)(delayed(getExampleRow)(i) for i in parallelInputs)
df = pd.concat(results, axis=0)
results = None
gc.collect()

我在getExampleRow()函数上运行了探查器,结果如下:

Timer unit: 1e-06 s

Total time: 0.994672 s
File: <ipython-input-5-631cda7694d5>
Function: getExampleRow at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 1                                           def getExampleRow(inputs):
 2                                               #onle input index 2 is an array, the taret Timepoint
 3                                               global databaseHash 
 4         1          4.0      4.0      0.0      concatThese = []
 5         1          2.0      2.0      0.0      count = 0
 6       541        820.0      1.5      0.1      for currentTimepoint in range(inputs[0], inputs[1], inputs[3]):
 7       540     824867.0   1527.5     82.9          concatThese.append(databaseHash[currentTimepoint].add_suffix("_"+str(count)))
 8       540       1525.0      2.8      0.2          count += 1
 9                                                   
10                                               #Add all of the the target timepoints at the end
11         1          2.0      2.0      0.0      targetCount = 0
12        16         29.0      1.8      0.0      for targetTimepoint in inputs[2]: 
13        15      22245.0   1483.0      2.2          concatThese.append(databaseHash[targetTimepoint].add_suffix('_target_'+str(inputs[4][targetCount])))
14        15         40.0      2.7      0.0          targetCount += 1
15                                                                      
16         1     145138.0 145138.0     14.6      return pd.concat(concatThese, axis=1)

当我摆脱_suffix()调用时,它占总时间的百分比降低。

Timer unit: 1e-06 s
Total time: 0.160778 s
File: <ipython-input-11-573f87244998>
Function: getExampleRow at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 1                                           def getExampleRow(inputs):
 2                                               #onle input index 2 is an array, the taret Timepoint
 3                                               global databaseHash 
 4         1          3.0      3.0      0.0      concatThese = []
 5         1          2.0      2.0      0.0      count = 0
 6       541        520.0      1.0      0.3      for currentTimepoint in range(inputs[0], inputs[1], inputs[3]):
 7       540       1005.0      1.9      0.6          concatThese.append(databaseHash[currentTimepoint])
 8       540        522.0      1.0      0.3          count += 1
 9                                                   
10                                               #Add all of the the target timepoints at the end
11         1          1.0      1.0      0.0      targetCount = 0
12        16         24.0      1.5      0.0      for targetTimepoint in inputs[2]: 
13        15      16415.0   1094.3     10.2          concatThese.append(databaseHash[targetTimepoint].add_suffix('_target_'+str(inputs[4][targetCount])))
14        15         36.0      2.4      0.0          targetCount += 1
15                                                                      
16         1     142250.0 142250.0     88.5      return pd.concat(concatThese, axis=1)

有没有一种快速的方法可以使我的列名在每个时间点列中唯一?

0 个答案:

没有答案