熊猫中的条件计算

时间:2020-09-03 03:24:05

标签: python pandas

我使用以下方法在SQL中进行计算:

SELECT 
    round(sum(total_size) / (1024*1024), 2) 'Total (PB)' ,
    round(sum(keep_size) / (1024*1024), 2) 'Keep (PB)' ,
    round(sum(remove_size) / (1024*1024), 2) 'Remove (PB)' 
FROM (
  SELECT
    case when dedupe_status='K' then path when dedupe_status='R' then null when dedupe_status='G' then group_super end as g_key,
    round(sum(file_size), 2) total_size,
    case when dedupe_status='R' then round(sum(file_size), 2) when dedupe_status='K' then 0 when dedupe_status='G' then round(sum(file_size) - file_size, 2) end remove_size,
    case when dedupe_status='R' then 0 when dedupe_status='K' then file_size when dedupe_status='G' then round(sum(file_size) - (sum(file_size) - file_size), 2) end keep_size
    from dedupe__df group by g_key
  ) clean_list

我仅将此内容作为参考。这是我要在熊猫中进行相同计算的数据框。这是我拥有的数据帧:

df=pd.DataFrame([
    {'dedupe_status': 'R', 'size': 134, 'dedupe_key': 'g_149'},
    {'dedupe_status': 'K', 'size': 101, 'dedupe_key': 'g9'},
    {'dedupe_status': 'G', 'size': 101, 'dedupe_key': 'x09'},
    {'dedupe_status': 'G', 'size': 405, 'dedupe_key': 'xx01'},
    {'dedupe_status': 'G', 'size': 4, 'dedupe_key': 'x09'},
    {'dedupe_status': 'G', 'size': 1405, 'dedupe_key': 'xx01'},
    {'dedupe_status': 'G', 'size': 401, 'dedupe_key': 'xx01'},
])

我想得到一个包含三个值的结果,分别为Total SizeRemove SizeKeep Size。这是它们的计算方式:

  • Total:简单,只需将所有大小相加即可。
  • Keep:如果状态为K保持),则将大小相加。如果状态为R删除),请跳过它;如果状态为G Group ),则在{{1} },并仅保留一个大小(无论哪一种大小,如果最简单,您都可以抓取dedupe_keyfirst)。换句话说,当值为min时,表示该组中的所有元素都是重复的,我们只需要保留其中一个即可。
  • GRemove-Total

使用上述值,我们将:

Keep

到目前为止,我有:

field           value            # comments
Total           2551             # df['size'].sum()
Keep            607              # 101 (K) + 101 (G: x09) + 405(G: xx01)
Remove          1944             # 134 (R) + 4 (G: x08) + 1405+401 (G: xx01)

剩下的熊猫怎么办?

2 个答案:

答案 0 :(得分:1)

让我们尝试np.where来找到那些Keep

mask = np.where(df.dedupe_status.eq('R') | 
                (df.duplicated(['dedupe_status', 'dedupe_key']) &
                 df.dedupe_status.eq('G') 
                ),
                'Remove', 'Keep')
                
                

ret = df.groupby(mask)['size'].sum()
ret.loc['Total'] = ret.sum()

输出:

Keep       607
Remove    1944
Total     2551
Name: size, dtype: int64

答案 1 :(得分:1)

您可以创建一个新的df1来汇总您建议的信息。最后,您可以使用iloc仅选择第一行和最后三列,然后使用.T来转置数据帧:

df1 = (df.assign(Total=df['size'].sum())
         .assign(Keep=df[df['dedupe_status'] == 'K']['size'].sum()
                 + df[df['dedupe_status'] == 'G'].groupby('dedupe_key')['size'].min().sum()))
df1 = df1.assign(Remove=df1['Total'] - df1['Keep']).iloc[0,-3:].T
df1

Total     2551
Keep       506 #101 + 4 + 401
Remove    2045
Name: 0, dtype: object