Google Big Query SQL - 获取最新的列值

时间:2014-08-12 16:25:36

标签: sql google-bigquery

我有一个Google Big查询表,其中包含email列。基本上每行都显示存在该电子邮件地址的用户的状态。我想要查询表格以获得显示每个电子邮件地址的最新行的结果。我已经尝试了各种GROUP BY的{​​{1}},JOIN反对自己以及我将在MySQL中使用的通常有趣的东西,但如果整行都没有,我会不断收到重复的电子邮件。匹配。

非常感谢任何帮助!

样本数据

user_email     | user_first_name | user_last_name | time      | is_deleted
test@test.com  | Joe             | John           | 123456790 |  1
test@test.com  | Joe             | John           | 123456789 |  0
test2@test.com | Jill            | John           | 123456789 |  0

因此,如果对我想要返回的数据进行抽样:

user_email     | user_first_name | user_last_name | time      | is_deleted
test@test.com  | Joe             | John           | 123456790 |  1
test2@test.com | Jill            | John           | 123456789 |  0

3 个答案:

答案 0 :(得分:10)

SELECT user_email, user_first_name, user_last_name, time, is_deleted 
FROM (
 SELECT user_email, user_first_name, user_last_name, time, is_deleted
      , RANK() OVER(PARTITION BY user_email ORDER BY time DESC) rank
 FROM table
)
WHERE rank=1

答案 1 :(得分:2)

解决!

SELECT l.* FROM [mytable.list] l JOIN (
    SELECT user_email, MAX(time) as time FROM [mytable.list] GROUP EACH BY user_email
) j ON j.user_email = l.user_email WHERE j.time = l.time;

答案 2 :(得分:0)

在我的工作中,我发现了使用 RANK() 的潜在缺点(可能是更新的?https://cloud.google.com/bigquery/docs/reference/standard-sql/numbering_functions)替代编号函数 ROW_NUMBER()

with minimal_reproducible as (
select 'test@test.com' as user_email, 'Joe' as user_first_name, 'John' as user_last_name, 123456790 as time, 1 is_deleted
union all
select 'test@test.com', 'Joe', 'John', 123456789, 0
union all
select 'test2@test.com', 'Jill', 'John', 123456789, 0
)

select user_email, user_first_name, user_last_name, time, is_deleted from (
    select *, 
    rank() over (partition by user_email order by time desc) as rank
    from minimal_reproducible) inner_table 
where rank = 1

接受的答案确实提供了所需的解决方案,除非在 order by 子句中出现平局,并且再次返回重复记录:

with minimal_reproducible as (
select 'test@test.com' as user_email, 'Joe' as user_first_name, 'John' as user_last_name, 123456789 as time, 1 is_deleted
union all
select 'test@test.com', 'Joe', 'John', 123456789, 0
union all
select 'test2@test.com', 'Jill', 'John', 123456789, 0
)

select user_email, user_first_name, user_last_name, time, is_deleted from (
    select *, 
    rank() over (partition by user_email order by time desc) as rank
    from minimal_reproducible) inner_table 
where rank = 1;

因此,更好的解决方案是使用 ROW_NUMBER() 代替 RANK() 以确保(尽管是任意的)唯一的 user_email

with minimal_reproducible as (
select 'test@test.com' as user_email, 'Joe' as user_first_name, 'John' as user_last_name, 123456789 as time, 1 is_deleted
union all
select 'test@test.com', 'Joe', 'John', 123456789, 0
union all
select 'test2@test.com', 'Jill', 'John', 123456789, 0
)

select user_email, user_first_name, user_last_name, time, is_deleted from (
    select *, 
    row_number() over (partition by user_email order by time desc) as row_number
    from minimal_reproducible) inner_table 
where row_number = 1;

我希望这对使用这种方法去重复他们的表的任何人都有帮助。