我有两个数据集,我想合并为一个熊猫数据框。他们看起来像这样:
df1 = pandas.DataFrame({
'protein': ['A']*4 + ['B']*4,
'repeat':range(1, 9),
'measurement1': [numpy.nan]*4 + list(numpy.random.uniform(0, 1, 4)),
'measurement2': list(numpy.random.uniform(0, 1, 4)) + [numpy.nan]*4,
'measurement3': list(numpy.random.uniform(0, 1, 4)) + [numpy.nan]*4,
})
df2 = pandas.DataFrame({
'protein': ['A']*2 + ['B']*2,
'repeat':range(1, 5),
'measurement1': list(numpy.random.uniform(0, 1, 4)),
'measurement4': list(numpy.random.uniform(0, 1, 4)),
'measurement5': list(numpy.random.uniform(0, 1, 4)),
})
idx = ['protein', 'repeat']
df1.set_index(idx, inplace=True)
df2.set_index(idx, inplace=True)
第一个:
>>> df1
measurement1 measurement2 measurement3
protein repeat
A 1 NaN 0.757366 0.858163
2 NaN 0.453202 0.287777
3 NaN 0.434762 0.044638
4 NaN 0.825710 0.653887
B 5 0.732218 NaN NaN
6 0.380481 NaN NaN
7 0.444811 NaN NaN
8 0.569743 NaN NaN
第二个
>>> df2
measurement1 measurement4 measurement5
protein repeat
A 1 0.342011 0.174242 0.071223
2 0.416247 0.820345 0.048176
B 3 0.240464 0.767659 0.328830
4 0.985637 0.459141 0.089130
如何合并这些数据框,以便获得类似这样的内容:
measurement1 measurement2 measurement3 measurement4 measurement5
protein repeat
A 1 0.721179 0.019207 0.189169 0.186984 0.316553
2 0.425959 0.301376 0.677409 0.794600 0.668739
3 0.675156 0.834304 0.022280 0.414653 0.263979
4 0.667983 0.563201 0.841316 0.062459 0.584332
B 5 0.598407 NaN NaN NaN NaN
6 0.658570 NaN NaN NaN NaN
7 0.226620 NaN NaN NaN NaN
8 0.958272 NaN NaN NaN NaN
答案 0 :(得分:4)
似乎df2
是错误的,只有A
级别:
df2 = pd.DataFrame({
'protein': ['A']*4,
'repeat':range(1, 5),
'measurement1': list(np.random.uniform(0, 1, 4)),
'measurement4': list(np.random.uniform(0, 1, 4)),
'measurement5': list(np.random.uniform(0, 1, 4)),
})
idx = ['protein', 'repeat']
df2.set_index(idx, inplace=True)
print (df2)
measurement1 measurement4 measurement5
protein repeat
A 1 0.927584 0.741862 0.165938
2 0.569004 0.048579 0.780998
3 0.457412 0.708697 0.286537
4 0.753526 0.839243 0.306470
所以可以使用:
df = df2.combine_first(df1).reset_index()
df = df[df.columns[2:].tolist() + df.columns[:2].tolist()]
print (df)
measurement1 measurement2 measurement3 measurement4 measurement5 \
0 0.539505 0.241686 0.894978 0.988329 0.963004
1 0.626309 0.095530 0.043223 0.375186 0.341831
2 0.005545 0.238250 0.301947 0.097038 0.798923
3 0.484909 0.807791 0.980582 0.461909 0.798846
4 0.463653 NaN NaN NaN NaN
5 0.502216 NaN NaN NaN NaN
6 0.313669 NaN NaN NaN NaN
7 0.047340 NaN NaN NaN NaN
protein repeat
0 A 1
1 A 2
2 A 3
3 A 4
4 B 5
5 B 6
6 B 7
7 B 8
答案 1 :(得分:1)
更广泛的解决方案是使用pandas.merge
,然后在两个fillna
列之间使用measurement1
。但不是jezraels回答的那么干净。
在某些情况下,使用combine first
注意,我更改了第二个数据框索引,方法与jezrael
相同。
df_merge = pd.merge(df1, df2, left_index=True, right_index=True, how='left', suffixes=['', '_2'])
df_merge['measurement1'].fillna(df_merge['measurement1_2'], inplace=True)
df_merge.drop('measurement1_2', axis=1, inplace=True)
print(df_merge)
measurement1 measurement2 measurement3 measurement4 \
protein repeat
A 1 0.947668 0.361499 0.679650 0.001189
2 0.335468 0.155245 0.651453 0.217520
3 0.249411 0.364105 0.395564 0.523953
4 0.550545 0.889828 0.592233 0.973457
B 5 0.655718 NaN NaN NaN
6 0.052645 NaN NaN NaN
7 0.013689 NaN NaN NaN
8 0.640769 NaN NaN NaN
measurement5
protein repeat
A 1 0.841053
2 0.291956
3 0.097706
4 0.573144
B 5 NaN
6 NaN
7 NaN
8 NaN