大数据集上的熊猫合并非常慢

时间:2020-04-24 19:36:06

标签: python pandas

我有一条管道,可以获取4000k HL7文件。我必须将其转换为csv。每个文件将具有许多HL7细分,每个细分(OBX)将具有一列(COL1,COl2..COL100),其值和时间是。 每个文件可能具有50到100 OBX。列也是如此。 我正在遍历每一列并创建熊猫数据帧并将其添加到列中。如果列属于数据帧中已经存在的时间。如果数据帧中没有时间,则应追加列。在数据帧中创建新行。 最终,我将所有文件的所有数据框都合并在一起。这需要很多时间。 我观察到最终合并(函数process_hl7msg)会花费很多时间。

def parse_segments():
    df_num = pd.DateFrame()
    for each segment in segments:    
        obx_timestamp = get obx_timestamp from segment    
        observation_value = get obx_timestamp from segment    
        device = get device info from segment    
        df = pd.DataFrame()    
        df=df.append({"Time": obx_timestamp, obs_identifier: observation_value, "device": device}, ignore_index=True)    
        if df_num.empty:
           df_num = df
        else:
           df_num = pd.merge(df_num, df, on=["Time", "device"])
    return df_num    


def process_hl7msg():
    df_list = []
    for file_name in file_list:
       segments = get segments
       df_list.append(parse_segments(segments))

    for df1 in df_list:
        if df.empty:
            df = df1
        else:
            df = pd.merge(df, df1, on=["Time", "device"], how='outer')

下面是每个已解析的hl7文件的示例,并预期将其输出。

File 1  
Time                       EVENT device  COL1  COL2   
20200420232613.6200+0530   start device1 1.0   2.3  
20200420232614.6200+0530         device1 4.4   1.7  

File 2   
Time                      EVENT  device  COL3   COL4  COL5   
20200420232613.6200+0530         device1  44     66    7
20200420232614.6200+0530         device2  1.0    2.3    0.5   
20200420232615.6200+0530  pause  device3  4.4    1.7    0.9

File 3
20200420232613.6200+0530   device2 1.0   2.3  
...
File 4000



**Expected Output:**    
Time                      EVENT device   COL1  COL2  COL3   COL4  COL5   
20200420232613.6200+0530  start  device1  1.0   2.3    44     66    7
20200420232613.6200+0530         device2  1.0   2.3  
20200420232614.6200+0530  end    device1         4.4   1.7  
20200420232615.6200+0530  pause  device2               1.0    2.3    0.5   
20200420232616.6200+0530         device3               4.4    1.7    0.9

任何建议对此进行优化,将不胜感激

UPDATE1:

obx_timestamp =20200420232616.6200+0530 
obs_identifier= any one or more value from the list (COL1, COL2, ......COl10)
observation_value any numeric value
device it can be any one of from the list (device1,device2, device3, device4, device5)

UPDATE2:
添加了事件列

t3=[{'Time': 100, 'device': 'device1', 'EVENT':'' 'event','obx_idx': 'MDC1','value':1.2}, 
    {'Time': 100, 'device': 'device1', 'obx_idx': 'COL2','value':4.5},
    {'Time': 100, 'device': 'device1', 'obx_idx': 'COL4','value':4.5}, 
    {'Time': 200, 'device': 'device3', 'obx_idx': 'COL2','value':2.5},
    {'Time': 200, 'device': 'device3', 'obx_idx': 'COl3','value':2.5}]
df=pd.DataFrame.from_records(t3, index=['Time','device','EVENT','obx_idx'])['value'].unstack()

2 个答案:

答案 0 :(得分:0)

尝试在两个数据帧上设置索引并进行联接:

df.set_index(["Time", "device"], inplace=True)
df1.set_index(["Time", "device"], inplace=True)
df.join(df1, how = 'outer')

但是,基于预期的输出,您也可以尝试在concat上进行axis = 1

df.set_index(["Time", "device"], inplace=True)
df1.set_index(["Time", "device"], inplace=True)
df_f = pd.concat([df, df1], axis=1)

答案 1 :(得分:0)

您可以在此处更改功能,其想法是不要在parse_segment的每个循环中创建数据帧,而只能在最后使用from_records指定索引级别以能够使用{ {1}}之后。要在unstack中将pd.concat与axis = 1一起使用,请尝试

process_hl7msg

如果它不是太大(不确定此数据源),您甚至可以一次执行所有操作:

def parse_segments():
    l_seg = []
    for each segment in segments:    
        obx_timestamp = get obx_timestamp from segment    
        obs_identifier = get ...
        observation_value = get obx_timestamp from segment    
        device = get device info from segment    
        # append a dictionary to a list
        l_seg.append({'time': obx_timestamp, 'device':device, 
                      'obs_idx':obs_identifier, 'value':observation_value})
    # create the dataframe with from_records and specify the index
    return pd.DataFrame.from_records(l_seg, index=['time','device','obs_idx'])['value']\
                       .unstack()    

def process_hl7msg():
    df_list = []
    for file_name in file_list:
       segments = get segments
       df_list.append(parse_segments(segments))
    #use concat
    return pd.concat(df_list, axis=1).reset_index()