如何比较两个数据帧以推断第三个数据帧

时间:2020-07-24 17:15:19

标签: pandas dataframe grouping

df_base = 
  time_id  object_id  gt_class  hp_class   
0   1      a          CAR       ""         
1   1      b          CAR       ""         
2   2      c          PERSON    PERSON     
3   2      d          PERSON    PERSON     
4   2      e          CAR       ""         

df_feature = 
  time_id   object_id   gt_class    hp_class     
0   1        a          CAR         CAR         
1   1        b          CAR         CAR        
2   2        c         PERSON       ""         
3   2        d         PERSON       ""         
4   2        e          CAR         ""         

每个数据帧在时间object_id代表一个time_id(地面真理类)的gt_class,相应的假设类为hp_class。 如果错过了真相,则对应的hp_class=""

我需要根据df_base比较df_featuretime_id。并提出以下数据框

compare_df = 
time_id  gt_class num_missed_base num_missed_feature
1        "CAR"    2               0
1        "PERSON" 0               0
2        "PERSON" 0               2
2        "CAR"    1               1

例如,在上面的示例中,time_id == 1gt_class=="CAR"base_df中有两个丢失的对象,而feature_df中有两个丢失的对象。

但是我不知道该怎么做。任何帮助表示赞赏。

2 个答案:

答案 0 :(得分:2)

数据:

df_base = pd.DataFrame.from_dict({'time_id':[1,1,2,2,2], 'object_id':['a','b','c','d','e'], 'gt_class':['CAR', 'CAR', 'PERSON', 'PERSON', 'CAR'], 
            'hp_class':['','','PERSON','PERSON','']})
df_feature = pd.DataFrame.from_dict({'time_id':[1,1,2,2,2], 'object_id':['a','b','c','d','e'], 'gt_class':['CAR', 'CAR', 'PERSON', 'PERSON', 'CAR'], 
            'hp_class':['CAR','CAR','','','']})

添加一个flag列,其中1表示丢失的数据:

df_feature['flag'] = df_feature.hp_class.apply(lambda x: 1 if x=='' else 0)
df_base['flag'] = df_base.hp_class.apply(lambda x: 1 if x=='' else 0)

根据time_idgt_class进行分组并汇总缺失值:

df1 = df_base.groupby(['time_id', 'gt_class'])['flag'].agg(num_missed_base='sum')
df2 = df_feature.groupby(['time_id', 'gt_class'])['flag'].agg(num_missed_feature='sum')
df = pd.concat([df1, df2], axis = 1, levels=0)
print(df)
                  num_missed_base  num_missed_feature
time_id gt_class                                     
1       CAR                     2                   0
2       CAR                     1                   1
        PERSON                  0                   2

答案 1 :(得分:0)

更新熊猫1.1.0 pd.DataFrame.compare方法

df_comp = df_base.compare(df_feature, keep_shape=True, keep_equal=True)
df_out = (df_comp['hp_class']=='').groupby([df_base['time_id'], df_base['gt_class']])\
   .sum().rename(columns={'self':'num_missing_base', 'other':'num_missing_feature'})
print(df_out)

输出:

                  num_missing_base  num_missing_feature
time_id gt_class                                       
1       CAR                      2                    0
2       CAR                      1                    1
        PERSON                   0                    2