我想获得以下问题的输出。 我有以下数据类型:
id start end count Time Train
001 Paris London 01 05:00 Yes
001 Paris London 01 05:00 Yes
002 Prague Vienna 15 15:00 No
003 Frankfurt London 01 17:00 Yes
015 Paris London 08 21:00 No
019 Barcelona Vienna 15 15:00 No
003 Frankfurt London 01 07:00 Yes
002 Prague Vienna 15 05:00 No
我想找到按ID分组的计数总和。还要忽略具有相同id,start和end的行。我也有4 gb的数据,我想找到前5名的开始和结束城市。谢谢。
我想获得的输出数据与此类似,
Prague -> Vienna Count : 15
Barcelona -> Vienna count : 15
Paris --> london Count : 09
Frankfurt -> London Count: 02
.....
答案 0 :(得分:0)
您可以使用drop_duplicates
+ groupby
汇总sum
:
df['count'] = df['count'].astype(int)
df = df.drop_duplicates(['id','start','end'])
print (df)
id start end count Time Train
0 001 Paris London 1 05:00 Yes
2 002 Prague Vienna 15 15:00 No
3 003 Frankfurt London 1 07:00 Yes
4 015 Paris London 8 21:00 No
5 019 Barcelona Vienna 15 15:00 No
df1 = df.groupby('id', as_index=False)['count'].sum()
print (df1)
id count
0 001 1
1 002 15
2 003 1
3 015 8
4 019 15
df11 = df.groupby(['id', 'start', 'end'], as_index=False)['count'].sum()
print (df11)
id start end count
0 001 Paris London 1
1 002 Prague Vienna 15
2 003 Frankfurt London 1
3 015 Paris London 8
4 019 Barcelona Vienna 15
df12 = df.groupby(['start', 'end'], as_index=False)['count'].sum()
print (df12)
start end count
0 Barcelona Vienna 15
1 Frankfurt London 1
2 Paris London 9
3 Prague Vienna 15
对于最高值,请使用nlargest
:
df2 = df.nlargest(5, 'count')[['start','end']]
print (df2)
start end
2 Prague Vienna
5 Barcelona Vienna
4 Paris London
0 Paris London
3 Frankfurt London
答案 1 :(得分:-1)
SELECT T.* FROM
(
SELECT *,COUNT(id) AS count FROM TABLE1 GROUP BY id,start,end
) T
GROUP BY id ORDER BY count DESC LIMIT 0,5