我使用以下方法在SQL中进行计算:
SELECT
round(sum(total_size) / (1024*1024), 2) 'Total (PB)' ,
round(sum(keep_size) / (1024*1024), 2) 'Keep (PB)' ,
round(sum(remove_size) / (1024*1024), 2) 'Remove (PB)'
FROM (
SELECT
case when dedupe_status='K' then path when dedupe_status='R' then null when dedupe_status='G' then group_super end as g_key,
round(sum(file_size), 2) total_size,
case when dedupe_status='R' then round(sum(file_size), 2) when dedupe_status='K' then 0 when dedupe_status='G' then round(sum(file_size) - file_size, 2) end remove_size,
case when dedupe_status='R' then 0 when dedupe_status='K' then file_size when dedupe_status='G' then round(sum(file_size) - (sum(file_size) - file_size), 2) end keep_size
from dedupe__df group by g_key
) clean_list
我仅将此内容作为参考。这是我要在熊猫中进行相同计算的数据框。这是我拥有的数据帧:
df=pd.DataFrame([
{'dedupe_status': 'R', 'size': 134, 'dedupe_key': 'g_149'},
{'dedupe_status': 'K', 'size': 101, 'dedupe_key': 'g9'},
{'dedupe_status': 'G', 'size': 101, 'dedupe_key': 'x09'},
{'dedupe_status': 'G', 'size': 405, 'dedupe_key': 'xx01'},
{'dedupe_status': 'G', 'size': 4, 'dedupe_key': 'x09'},
{'dedupe_status': 'G', 'size': 1405, 'dedupe_key': 'xx01'},
{'dedupe_status': 'G', 'size': 401, 'dedupe_key': 'xx01'},
])
我想得到一个包含三个值的结果,分别为Total Size
,Remove Size
和Keep Size
。这是它们的计算方式:
Total
:简单,只需将所有大小相加即可。Keep
:如果状态为K
(保持),则将大小相加。如果状态为R
(删除),请跳过它;如果状态为G
( Group ),则在{{1} },并仅保留一个大小(无论哪一种大小,如果最简单,您都可以抓取dedupe_key
或first
)。换句话说,当值为min
时,表示该组中的所有元素都是重复的,我们只需要保留其中一个即可。G
:Remove
-Total
使用上述值,我们将:
Keep
到目前为止,我有:
field value # comments
Total 2551 # df['size'].sum()
Keep 607 # 101 (K) + 101 (G: x09) + 405(G: xx01)
Remove 1944 # 134 (R) + 4 (G: x08) + 1405+401 (G: xx01)
剩下的熊猫怎么办?
答案 0 :(得分:1)
让我们尝试np.where
来找到那些Keep
:
mask = np.where(df.dedupe_status.eq('R') |
(df.duplicated(['dedupe_status', 'dedupe_key']) &
df.dedupe_status.eq('G')
),
'Remove', 'Keep')
ret = df.groupby(mask)['size'].sum()
ret.loc['Total'] = ret.sum()
输出:
Keep 607
Remove 1944
Total 2551
Name: size, dtype: int64
答案 1 :(得分:1)
您可以创建一个新的df1来汇总您建议的信息。最后,您可以使用iloc仅选择第一行和最后三列,然后使用.T
来转置数据帧:
df1 = (df.assign(Total=df['size'].sum())
.assign(Keep=df[df['dedupe_status'] == 'K']['size'].sum()
+ df[df['dedupe_status'] == 'G'].groupby('dedupe_key')['size'].min().sum()))
df1 = df1.assign(Remove=df1['Total'] - df1['Keep']).iloc[0,-3:].T
df1
Total 2551
Keep 506 #101 + 4 + 401
Remove 2045
Name: 0, dtype: object