df_base =
time_id object_id gt_class hp_class
0 1 a CAR ""
1 1 b CAR ""
2 2 c PERSON PERSON
3 2 d PERSON PERSON
4 2 e CAR ""
df_feature =
time_id object_id gt_class hp_class
0 1 a CAR CAR
1 1 b CAR CAR
2 2 c PERSON ""
3 2 d PERSON ""
4 2 e CAR ""
每个数据帧在时间object_id
代表一个time_id
(地面真理类)的gt_class
,相应的假设类为hp_class
。
如果错过了真相,则对应的hp_class=""
。
我需要根据df_base
比较df_feature
和time_id
。并提出以下数据框
compare_df =
time_id gt_class num_missed_base num_missed_feature
1 "CAR" 2 0
1 "PERSON" 0 0
2 "PERSON" 0 2
2 "CAR" 1 1
例如,在上面的示例中,time_id == 1
和gt_class=="CAR"
在base_df
中有两个丢失的对象,而feature_df
中有两个丢失的对象。
但是我不知道该怎么做。任何帮助表示赞赏。
答案 0 :(得分:2)
数据:
df_base = pd.DataFrame.from_dict({'time_id':[1,1,2,2,2], 'object_id':['a','b','c','d','e'], 'gt_class':['CAR', 'CAR', 'PERSON', 'PERSON', 'CAR'],
'hp_class':['','','PERSON','PERSON','']})
df_feature = pd.DataFrame.from_dict({'time_id':[1,1,2,2,2], 'object_id':['a','b','c','d','e'], 'gt_class':['CAR', 'CAR', 'PERSON', 'PERSON', 'CAR'],
'hp_class':['CAR','CAR','','','']})
添加一个flag
列,其中1表示丢失的数据:
df_feature['flag'] = df_feature.hp_class.apply(lambda x: 1 if x=='' else 0)
df_base['flag'] = df_base.hp_class.apply(lambda x: 1 if x=='' else 0)
根据time_id
和gt_class
进行分组并汇总缺失值:
df1 = df_base.groupby(['time_id', 'gt_class'])['flag'].agg(num_missed_base='sum')
df2 = df_feature.groupby(['time_id', 'gt_class'])['flag'].agg(num_missed_feature='sum')
df = pd.concat([df1, df2], axis = 1, levels=0)
print(df)
num_missed_base num_missed_feature
time_id gt_class
1 CAR 2 0
2 CAR 1 1
PERSON 0 2
答案 1 :(得分:0)
pd.DataFrame.compare
方法df_comp = df_base.compare(df_feature, keep_shape=True, keep_equal=True)
df_out = (df_comp['hp_class']=='').groupby([df_base['time_id'], df_base['gt_class']])\
.sum().rename(columns={'self':'num_missing_base', 'other':'num_missing_feature'})
print(df_out)
输出:
num_missing_base num_missing_feature
time_id gt_class
1 CAR 2 0
2 CAR 1 1
PERSON 0 2