我有主数据帧和辅助数据帧。当ID变量组合相同时,我想用辅助数据帧中的值替换主要数据帧中的值。 ID变量之一在主数据帧中具有混合数据类型。我能够解决该问题,但是我的解决方案似乎过于复杂,我希望这里的某个人可能能够帮助我找到更优雅的方法。
请注意,永远不需要替换ID2 ='缺少'或'indicator'= 1的行。
primary_df = pd.DataFrame(data=
{'ID1': ['XXX111','XXX111','XXX111','XXX111','YYY222','YYY222','ZZZ333','ZZZ333','ZZZ333'],
'ID2': ['0-100', -1.0, -2.0, -3.0, '0-10', -1.0,'300-400', 'Missing', '-4.0'],
'value' : [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
'indicator': [1,np.nan, np.nan, np.nan, 1, np.nan, 1, np.nan, np.nan]})
secondary_df = pd.DataFrame(data=
{'ID1': list(['XXX111','ZZZ333']),
'ID2': list([-3,-4]),
'value': list([0.04, 0.09])})
desired_df = pd.DataFrame(data=
{'ID1': ['XXX111','XXX111','XXX111','XXX111','YYY222','YYY222','ZZZ333','ZZZ333','ZZZ333'],
'ID2': ['0-100', -1, -2, -3, '0-10', -1,'300-400', 'Missing', -4],
'value' : [0.1, 0.2, 0.3, 0.04, 0.5, 0.6, 0.7, 0.8, 0.09],
'indicator': [1,np.nan, np.nan, np.nan, 1, np.nan, 1, np.nan, np.nan]})
In [6]: primary_df
Out[6]:
ID1 ID2 value indicator
0 XXX111 0-100 0.1 1.0
1 XXX111 -1 0.2 NaN
2 XXX111 -2 0.3 NaN
3 XXX111 -3 0.4 NaN
4 YYY222 0-10 0.5 1.0
5 YYY222 -1 0.6 NaN
6 ZZZ333 300-400 0.7 1.0
7 ZZZ333 Missing 0.8 NaN
8 ZZZ333 -4.0 0.9 NaN
In [7]:secondary_df
Out[7]:
ID1 ID2 value
0 XXX111 -3 0.04
1 ZZZ333 -4 0.09
desired_df
Out[8]:
ID1 ID2 value indicator
0 XXX111 0-100 0.10 1.0
1 XXX111 -1 0.20 NaN
2 XXX111 -2 0.30 NaN
3 XXX111 -3 0.04 NaN
4 YYY222 0-10 0.50 1.0
5 YYY222 -1 0.60 NaN
6 ZZZ333 300-400 0.70 1.0
7 ZZZ333 Missing 0.80 NaN
8 ZZZ333 -4 0.09 NaN
这是我非常难解决的解决方案:
pdfIndctr = primary_df.copy()[primary_df.indicator==1] # pick up rows with indicator = 1, will never need to be replaced
pdfMissing = primary_df.copy()[primary_df['ID2']=='Missing'] # pick up rows with ID2 = 'Missing', will never need to be replaced
pdfRest = primary_df.copy()[(primary_df['ID2'] != 'Missing') & (primary_df.indicator.isnull())] # pick up the rest of the rows
pdfRest['ID2'] = pdfRest.ID2.apply(lambda x: int(float(x))) # change the data type on ID2 for merging with secondary_df
pdfRest_fixed = pd.merge(pdfRest, secondary_df, on=['ID1','ID2'], how='inner', suffixes=['drop','']) # merge to fix the rows to be replaced
pdfRest_same = pd.merge(pdfRest, secondary_df, on=['ID1','ID2'], how='left', suffixes=['','drop'], indicator=True) # merge again to identify rows not to be replaced
pdfRest_same = pdfRest_same.copy()[pdfRest_same._merge=='left_only'] # drop the rows in the second merge that were also found in the secondary_df
desired_df = pdfIndctr.append(pdfMissing, sort=True).append(pdfRest_fixed, sort=True).append(pdfRest_same, sort=True) # put everything back together
desired_df.drop(columns = ['_merge','valuedrop'], inplace=True) # drop unnecessary rows