BigQuery按日期对类别列的每个值的最后5行进行过滤

时间:2019-10-30 00:24:31

标签: google-bigquery

很抱歉,标题有点罗word-我将在下面创建一个示例以突出显示我所指的内容。我有以下信息表:

describe(as.formula(paste(input$resp, '~', input$expl)), test)

t1

有了这张桌子,我只想:

  • 为每个团队过滤最近5个日期
  • 按团队分组并汇总num_val列

足够简单。但是,每个团队的日期都没有押韵或理由(我不能简单地筛选最近的5个日期,因为每个团队的日期可能不同)。我目前有以下查询框架:

date        team    num_val
2017-10-04    ab          7  
2017-10-03    ab          6
2017-10-02    ab          8
2017-10-05    ab          3
2017-10-07    ab         12
2017-10-06    ab          3
2017-10-01    ab          5
2017-09-08    cd          4
2017-09-09    cd          8
2017-09-10    cd          2
2017-09-14    cd          1
2017-09-13    cd          5
2017-09-11    cd          6
2017-09-12    cd         13

...非常感谢您的帮助,谢谢!

2 个答案:

答案 0 :(得分:1)

每个获取最新的5个:

SELECT team, ARRAY_AGG(num_val ORDER BY date DESC LIMIT 5) arr
FROM x
GROUP BY team

然后UNNEST(arr)并添加这些num_vals。

SELECT team, (SELECT SUM(num_val) FROM UNNEST(arr) num_val) the_sum
FROM (previous)

答案 1 :(得分:1)

BigQuery Standard SQL的其他选项很少,因此您会看到不同的方法

  

选项1

#standardSQL
SELECT team, SUM(num_val) sum_num FROM (
  SELECT team, num_val, ROW_NUMBER() OVER(PARTITION BY team ORDER BY DATE DESC) pos
  FROM `project.dataset.table`
)
WHERE pos <= 5
GROUP BY team
  

选项2

#standardSQL
SELECT team, sum_num FROM (
  SELECT team, 
    SUM(num_val) OVER(PARTITION BY team ORDER BY DATE DESC ROWS BETWEEN CURRENT ROW AND 4 FOLLOWING) AS sum_num, 
    ROW_NUMBER() OVER(PARTITION BY team ORDER BY DATE DESC) pos
  FROM `project.dataset.table`
)
WHERE pos = 1  

如果要应用于您的问题的样本数据-两者都会产生以下结果

Row team    sum_num  
1   ab      31   
2   cd      27     

虽然上述选项在某些更复杂的情况下很有用-在您的特定情况下-我会选择菲利普答案中提供的选项(类似于一个选项)

#standardSQL
SELECT team, (SELECT SUM(num_val) FROM UNNEST(num_values)) sum_num
FROM (
  SELECT team, ARRAY_AGG(STRUCT(num_val) ORDER BY DATE DESC LIMIT 5) num_values
  FROM `project.dataset.table`
  GROUP BY team
)