处理熊猫中另一个数据框中存在的数据框中的行

时间:2019-02-06 12:16:33

标签: python pandas dataframe

我有一个数据框,其中包含我所有的训练,验证和测试数据。还有一个仅包含我的测试数据的数据框。数据点由“ data_index”指定。

df_all = pd.DataFrame({'data_index': range(7), 'split': 'NA'})
df_all.set_index('data_index', inplace=True)

df_test = pd.DataFrame({'data_index': [3, 5], 'split': 'test'})
df_test.set_index('data_index', inplace=True)



           split
data_index      
0             NA
1             NA
2             NA
3             NA
4             NA
5             NA
6             NA

           split
data_index      
3           test
5           test

如何根据测试数据框在第一个数据框中填写“拆分”列的值?为了得到这样的东西:

                split
data_index           
0           train/val
1           train/val
2           train/val
3                test
4           train/val
5                test
6           train/val

2 个答案:

答案 0 :(得分:2)

Index.mapfillna结合使用:

df_all['split'] = df_all.index.map(df_test['split'].get)
df_all['split']= df_all['split'].fillna('train/val') 
print (df_all)
                split
data_index           
0           train/val
1           train/val
2           train/val
3                test
4           train/val
5                test
6           train/val

如果缺少值,请使用combine_first

#defined np.nan for missing values, not string NA
df_all = pd.DataFrame({'data_index': range(7), 'split': np.nan})
df_all.set_index('data_index', inplace=True)

df_test = pd.DataFrame({'data_index': [3, 5], 'split': 'test'})
df_test.set_index('data_index', inplace=True)

df_all['split'] = df_all['split'].combine_first(df_test['split']).fillna('train/val') 
print (df_all)
                split
data_index           
0           train/val
1           train/val
2           train/val
3                test
4           train/val
5                test
6           train/val

答案 1 :(得分:1)

除了如上所述的Index.map之外,还可以使用一些基本概念通过以下方法解决该问题:

df = pd.merge(df_all, df_test, how='left', on='data_index')
df.drop(['split_x'], axis=1, inplace=True)
df = df.rename(columns={'split_y': 'split'})
df.loc[df.split != 'test', 'split'] = 'train/val'

每行之后的结果是:

          split_x split_y
data_index                
0               NA     NaN
1               NA     NaN
2               NA     NaN
3               NA    test
4               NA     NaN
5               NA    test
6               NA     NaN

           split_y
data_index        
0              NaN
1              NaN
2              NaN
3             test
4              NaN
5             test
6              NaN

           split
data_index      
0            NaN
1            NaN
2            NaN
3           test
4            NaN
5           test
6            NaN

                split
data_index           
0           train/val
1           train/val
2           train/val
3                test
4           train/val
5                test
6           train/val