我有一条管道,可以获取4000k HL7文件。我必须将其转换为csv。每个文件将具有许多HL7细分,每个细分(OBX)将具有一列(COL1,COl2..COL100),其值和时间是。 每个文件可能具有50到100 OBX。列也是如此。 我正在遍历每一列并创建熊猫数据帧并将其添加到列中。如果列属于数据帧中已经存在的时间。如果数据帧中没有时间,则应追加列。在数据帧中创建新行。 最终,我将所有文件的所有数据框都合并在一起。这需要很多时间。 我观察到最终合并(函数process_hl7msg)会花费很多时间。
def parse_segments():
df_num = pd.DateFrame()
for each segment in segments:
obx_timestamp = get obx_timestamp from segment
observation_value = get obx_timestamp from segment
device = get device info from segment
df = pd.DataFrame()
df=df.append({"Time": obx_timestamp, obs_identifier: observation_value, "device": device}, ignore_index=True)
if df_num.empty:
df_num = df
else:
df_num = pd.merge(df_num, df, on=["Time", "device"])
return df_num
def process_hl7msg():
df_list = []
for file_name in file_list:
segments = get segments
df_list.append(parse_segments(segments))
for df1 in df_list:
if df.empty:
df = df1
else:
df = pd.merge(df, df1, on=["Time", "device"], how='outer')
下面是每个已解析的hl7文件的示例,并预期将其输出。
File 1
Time EVENT device COL1 COL2
20200420232613.6200+0530 start device1 1.0 2.3
20200420232614.6200+0530 device1 4.4 1.7
File 2
Time EVENT device COL3 COL4 COL5
20200420232613.6200+0530 device1 44 66 7
20200420232614.6200+0530 device2 1.0 2.3 0.5
20200420232615.6200+0530 pause device3 4.4 1.7 0.9
File 3
20200420232613.6200+0530 device2 1.0 2.3
...
File 4000
**Expected Output:**
Time EVENT device COL1 COL2 COL3 COL4 COL5
20200420232613.6200+0530 start device1 1.0 2.3 44 66 7
20200420232613.6200+0530 device2 1.0 2.3
20200420232614.6200+0530 end device1 4.4 1.7
20200420232615.6200+0530 pause device2 1.0 2.3 0.5
20200420232616.6200+0530 device3 4.4 1.7 0.9
任何建议对此进行优化,将不胜感激
UPDATE1:
obx_timestamp =20200420232616.6200+0530
obs_identifier= any one or more value from the list (COL1, COL2, ......COl10)
observation_value any numeric value
device it can be any one of from the list (device1,device2, device3, device4, device5)
UPDATE2:
添加了事件列
t3=[{'Time': 100, 'device': 'device1', 'EVENT':'' 'event','obx_idx': 'MDC1','value':1.2},
{'Time': 100, 'device': 'device1', 'obx_idx': 'COL2','value':4.5},
{'Time': 100, 'device': 'device1', 'obx_idx': 'COL4','value':4.5},
{'Time': 200, 'device': 'device3', 'obx_idx': 'COL2','value':2.5},
{'Time': 200, 'device': 'device3', 'obx_idx': 'COl3','value':2.5}]
df=pd.DataFrame.from_records(t3, index=['Time','device','EVENT','obx_idx'])['value'].unstack()
答案 0 :(得分:0)
尝试在两个数据帧上设置索引并进行联接:
df.set_index(["Time", "device"], inplace=True)
df1.set_index(["Time", "device"], inplace=True)
df.join(df1, how = 'outer')
但是,基于预期的输出,您也可以尝试在concat
上进行axis = 1
:
df.set_index(["Time", "device"], inplace=True)
df1.set_index(["Time", "device"], inplace=True)
df_f = pd.concat([df, df1], axis=1)
答案 1 :(得分:0)
您可以在此处更改功能,其想法是不要在parse_segment
的每个循环中创建数据帧,而只能在最后使用from_records
指定索引级别以能够使用{ {1}}之后。要在unstack
中将pd.concat
与axis = 1一起使用,请尝试
process_hl7msg
如果它不是太大(不确定此数据源),您甚至可以一次执行所有操作:
def parse_segments():
l_seg = []
for each segment in segments:
obx_timestamp = get obx_timestamp from segment
obs_identifier = get ...
observation_value = get obx_timestamp from segment
device = get device info from segment
# append a dictionary to a list
l_seg.append({'time': obx_timestamp, 'device':device,
'obs_idx':obs_identifier, 'value':observation_value})
# create the dataframe with from_records and specify the index
return pd.DataFrame.from_records(l_seg, index=['time','device','obs_idx'])['value']\
.unstack()
def process_hl7msg():
df_list = []
for file_name in file_list:
segments = get segments
df_list.append(parse_segments(segments))
#use concat
return pd.concat(df_list, axis=1).reset_index()