我需要在group by
输出的基础上执行group by
。例如,在下面的table1
:
id | timestamp | team
----------------------------
1 | 2016-01-02 | A
2 | 2016-02-01 | B
1 | 2016-02-04 | A
1 | 2016-03-05 | A
3 | 2016-05-12 | B
3 | 2016-05-15 | B
4 | 2016-07-07 | A
5 | 2016-08-01 | C
6 | 2015-08-01 | C
1 | 2015-04-01 | A
如果我查询:
query = select id, max(timestamp) as latest_ts from table1' + \
' where timestamp > "2016-01-01 00:00:00" group by id'
我会:
id | latest_ts |
---------------------
2 | 2016-02-01 |
1 | 2016-03-05 |
3 | 2016-05-15 |
4 | 2016-07-07 |
5 | 2016-08-01 |
但是,我想知道是否可以包含如下所示的team
列?
id | latest_ts | team
----------------------------
2 | 2016-02-01 | B
1 | 2016-03-05 | A
3 | 2016-05-15 | B
4 | 2016-07-07 | A
5 | 2016-08-01 | C
最终,我真正需要的是知道2016年每个团队中有多少不同的身份证明。我的预期结果应该是:
team | count(id)
-------------------
A | 2
B | 2
C | 1
我正在尝试使用下面的代码在第一个group by
结果之上执行另一个group by
,但是语法错误。
import pandas as pd
query = 'select team, count(id) from ' + \
'(select id, max(timestamp) as latest_ts from table1' + \
' where timestamp > "2016-01-01 00:00:00" group by id)' + \
'group by team'
cursor = impala_con.cursor()
cursor.execute('USE history')
cursor.execute(query)
df_result = as_pandas(cursor)
df_result
所以我想知道这是否可以实现?如果是这样,应该采取什么样的正确方法呢?谢谢!