我有一张桌子如下。
msno date num_25 num_50 num_75 num_985 num_100 num_unq
1 PNxIsSLWOJDCm7pNPFzRO/6Mmg2WeZA2nf6hw6t1x3g= 20151201 3 3 2 0 8 11
2 PNxIsSLWOJDCm7pNPFzRO/6Mmg2WeZA2nf6hw6t1x3g= 20160628 0 0 1 1 1 3
3 PNxIsSLWOJDCm7pNPFzRO/6Mmg2WeZA2nf6hw6t1x3g= 20170106 2 1 0 0 35 34
4 KXF9c/T66LZIzFq+xS64icWMhDQE6miCZAtdXRjZHX8= 20150803 0 0 0 0 16 11
5 KXF9c/T66LZIzFq+xS64icWMhDQE6miCZAtdXRjZHX8= 20160527 4 3 0 2 2 11
6 KXF9c/T66LZIzFq+xS64icWMhDQE6miCZAtdXRjZHX8= 20160808 14 3 4 1 15 31
我希望通过总结num_(25到unq)来对它们进行分组,然后确定最早的日期和最晚的日期出现在相同的msno中。
df = df_user_logs_v2.drop('date', axis=1).groupby('msno', as_index=False).sum()
上面的代码可以汇总所有值,但必须删除日期。我希望保留日期的最小值和最大值,以及行数。
第一个msno的预期输出:
msno num_25_sum num_50_sum num_75_sum num_985_sum num_100_sum num_unq_sum date_earliest date_latest count
1 PNxIsSLWOJDCm7pNPFzRO/6Mmg2WeZA2nf6hw6t1x3g= 5 4 3 1 44 48 20151201 20170106 3
答案 0 :(得分:0)
让我们试试这个:
d = dict((i,'sum') for i in df.columns[2:])
d['date'] = ['min','max']
d['msno'] = 'count'
df_out = df.groupby('msno').agg(d)
df_out.columns = df_out.columns.map('_'.join)
df_out
输出:
msno_count date_min date_max \
msno
KXF9c/T66LZIzFq+xS64icWMhDQE6miCZAtdXRjZHX8= 3 20150803 20160808
PNxIsSLWOJDCm7pNPFzRO/6Mmg2WeZA2nf6hw6t1x3g= 3 20151201 20170106
num_75_sum num_50_sum \
msno
KXF9c/T66LZIzFq+xS64icWMhDQE6miCZAtdXRjZHX8= 4 6
PNxIsSLWOJDCm7pNPFzRO/6Mmg2WeZA2nf6hw6t1x3g= 3 4
num_985_sum num_25_sum \
msno
KXF9c/T66LZIzFq+xS64icWMhDQE6miCZAtdXRjZHX8= 3 18
PNxIsSLWOJDCm7pNPFzRO/6Mmg2WeZA2nf6hw6t1x3g= 1 5
num_100_sum num_unq_sum
msno
KXF9c/T66LZIzFq+xS64icWMhDQE6miCZAtdXRjZHX8= 33 53
PNxIsSLWOJDCm7pNPFzRO/6Mmg2WeZA2nf6hw6t1x3g= 44 48