我有4个Excel文件,我必须合并到一个Excel文件中。 包含ID,姓名缩写,年龄和性别的人口统计文件。 包含ID,缩写测试名称,测试日期和测试值的实验室文件。 包含ID,姓名缩写,医疗状况,开始和停止日期的病史。 给予的药物含有ID,缩写,药物名称,剂量,频率,开始和停止日期。
有50名患者。人口统计文件包含50行50名患者。其余文件有50名患者,但在100到400行之间,因为每位患者都有多个实验室测试或多种药物。
当我合并大熊猫时,我有重复或将实体分配给错误的患者。面临的挑战是这样做,以便在患者服用的药物多于实验室检测的情况下,实验室检测应该用空白替换重复药物。
这是一个缩短的表示形式:
import pandas as pd
lab = pd.read_excel('data/data.xlsx', sheetname='lab')
drugs = pd.read_excel('data/data.xlsx', sheetname='drugs')
merged_data = pd.merge(drugs, lab, on='ID', how='left')
merged_data.to_excel('merged_data.xls')
你得到这个结果:Pandas merge result
我更喜欢这个结果:Prefered output
答案 0 :(得分:1)
考虑在cumcount()
上使用groupby()
,然后使用ID
加入该字段:
drugs['GrpCount'] = (drugs.groupby(['ID'])).cumcount()
lab['GrpCount'] = (lab.groupby(['ID'])).cumcount()
merged_data = pd.merge(drugs, lab, on=['ID', 'GrpCount'], how='left').drop(['GrpCount'], axis=1)
# ID Initials_x Drug Name Frequency Route Start Date End Date Initials_y Name Result Date Result
# 0 1 AB AMPICLOX NaN Oral 21-Jun-2016 21-Jun-2016 AB Rapid Diagnostic Test 30-May-16 Abnormal
# 1 1 AB CIPROFLOXACIN Daily Oral 30-May-2016 03-Jun-2016 AB Microscopy 30-May-16 Normal
# 2 1 AB Ibuprofen Tablet 400 mg Two Times a Day Oral 06-Oct-2016 10-Oct-2016 NaN NaN NaN NaN
# 3 1 AB COARTEM NaN Oral 17-Jun-2016 17-Jun-2016 NaN NaN NaN NaN
# 4 1 AB INJECTABLE ARTESUNATE 12 Hourly Intravenous 01-Jun-2016 02-Jun-2016 NaN NaN NaN NaN
# 5 1 AB COTRIMOXAZOLE Daily Oral 30-May-2016 12-Jun-2016 NaN NaN NaN NaN
# 6 1 AB METRONIDAZOLE Two Times a Day Oral 30-May-2016 03-Jun-2016 NaN NaN NaN NaN
# 7 2 SS GENTAMICIN Daily Intravenous 04-Jun-2016 04-Jun-2016 SS Microscopy 6-Jun-16 Abnormal
# 8 2 SS METRONIDAZOLE 8 Hourly Intravenous 04-Jun-2016 06-Jun-2016 SS Complete Blood Count 6-Oct-16 Recorded
# 9 2 SS Oral Rehydration Salts Powder PRN Oral 06-Jun-2016 06-Jun-2016 NaN NaN NaN NaN
# 10 2 SS ZINC 8 Hourly Oral 06-Jun-2016 06-Jun-2016 NaN NaN NaN NaN