如何找到按ID分组的计数总和?

时间:2017-06-09 12:30:57

标签: pandas

我想获得以下问题的输出。 我有以下数据类型:

id      start        end       count     Time      Train
001     Paris      London        01      05:00      Yes
001     Paris      London        01      05:00      Yes
002     Prague     Vienna        15      15:00      No
003     Frankfurt  London        01      17:00      Yes
015     Paris      London        08      21:00      No
019     Barcelona  Vienna        15      15:00      No
003     Frankfurt  London        01      07:00      Yes
002     Prague     Vienna        15      05:00      No

我想找到按ID分组的计数总和。还要忽略具有相同id,start和end的行。我也有4 gb的数据,我想找到前5名的开始和结束城市。谢谢。

我想获得的输出数据与此类似,

 Prague -> Vienna     Count : 15
 Barcelona -> Vienna  count : 15
 Paris --> london     Count : 09
 Frankfurt -> London  Count:  02
.....

2 个答案:

答案 0 :(得分:0)

您可以使用drop_duplicates + groupby汇总sum

df['count'] = df['count'].astype(int)
df = df.drop_duplicates(['id','start','end'])
print (df)
    id      start     end  count   Time Train
0  001      Paris  London      1  05:00   Yes
2  002     Prague  Vienna     15  15:00    No
3  003  Frankfurt  London      1  07:00   Yes
4  015      Paris  London      8  21:00    No
5  019  Barcelona  Vienna     15  15:00    No
df1 = df.groupby('id', as_index=False)['count'].sum()
print (df1)
    id  count
0  001      1
1  002     15
2  003      1
3  015      8
4  019     15

df11 = df.groupby(['id', 'start', 'end'], as_index=False)['count'].sum()
print (df11)
    id      start     end  count
0  001      Paris  London      1
1  002     Prague  Vienna     15
2  003  Frankfurt  London      1
3  015      Paris  London      8
4  019  Barcelona  Vienna     15

df12 = df.groupby(['start', 'end'], as_index=False)['count'].sum()
print (df12)
       start     end  count
0  Barcelona  Vienna     15
1  Frankfurt  London      1
2      Paris  London      9
3     Prague  Vienna     15

对于最高值,请使用nlargest

df2 = df.nlargest(5, 'count')[['start','end']]
print (df2)
       start     end
2     Prague  Vienna
5  Barcelona  Vienna
4      Paris  London
0      Paris  London
3  Frankfurt  London

答案 1 :(得分:-1)

SELECT T.* FROM
(
    SELECT *,COUNT(id) AS count FROM TABLE1 GROUP BY id,start,end
) T 
GROUP BY id ORDER BY count DESC LIMIT 0,5