为什么LAST_VALUE没有返回最后一个值?

时间:2016-01-30 00:18:45

标签: google-bigquery

我希望使用如下查询在有序分区上找到y的最后一个值:

SELECT
  x,
  LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC)
FROM table

但是LAST_VALUE会返回许多值,这些值不是给定分区的y的最后一个值(在这种情况下,最大值)。为什么呢?

(在这种情况下,可以使用MAX代替LAST_VALUE来查找最大值,但为什么LAST_VALUE也不会返回最大值?)

3 个答案:

答案 0 :(得分:22)

TLDR:您想要的查询是:

SELECT
  x,
  LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM table

可能后跟GROUP BY以折叠分析函数中的重复输出行。

当然,如果只需要MAX,那么只使用SELECT x, MAX(y) OVER (PARTITION BY x) FROM table 就可以了:

PARTITION BY

在回答这个问题之前,这里有一些关于分析函数的背景知识(a.k.a. window functions)。以下所有内容均为标准SQL,并非特定于BigQuery。

首先,分析函数不是聚合函数。聚合函数将多个输入行折叠为单个输出行,而解析函数仅为每个输入行计算一个输出行。因此,您需要确保考虑每个输入行的输出。

其次,分析功能在"窗口上运行"作为"分区的子集的行"该行所属的行。输入行的分区由ROWS子句确定,或者如果希望分区是整个输入行集,则可以省略它。该窗口由ORDER BY子句给出,但如果您没有指定它(并且用户通常不会),则默认为整个分区(未应用排序时)或从第一行到当前行的分区中的行集(当存在LAST_VALUE时)。请注意,窗口可能因分区中的每个输入行而不同!

现在,回到LAST_VALUE。虽然上面描述的默认窗口在许多情况下是合理的(例如,计算累积总和),但它对LAST_VALUE的效果非常差。 LAST_VALUE函数返回窗口中最后一行的值,默认情况下,窗口中的最后一行是当前行。

因此,要解决此问题,您需要明确指定SELECT x, LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM table 的窗口是整个分区,而不仅仅是当前行的行。你可以这样做:

SELECT
  x,
  FIRST_VALUE(x) OVER (ORDER BY x ASC) first_asc,
  FIRST_VALUE(x) OVER (ORDER BY x DESC) first_desc,
  LAST_VALUE(x) OVER (ORDER BY x ASC) last_asc,
  LAST_VALUE(x) OVER (ORDER BY x DESC) last_desc,
FROM
  (SELECT 4 as x),
  (SELECT 2 as x),
  (SELECT 1 as x),
  (SELECT 3 as x)

x,first_asc,first_desc,last_asc,last_desc
1,1,4,1,1
2,1,4,2,2
3,1,4,3,3
4,1,4,4,4

为了测试这个,这里有一个例子:

LAST_VALUE

请注意SELECT x, FIRST_VALUE(x) OVER (ORDER BY x ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) first_asc, FIRST_VALUE(x) OVER (ORDER BY x DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) first_desc, LAST_VALUE(x) OVER (ORDER BY x ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) last_asc, LAST_VALUE(x) OVER (ORDER BY x DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) last_desc, FROM (SELECT 4 as x), (SELECT 2 as x), (SELECT 1 as x), (SELECT 3 as x) x,first_asc,first_desc,last_asc,last_desc 1,1,4,4,1 2,1,4,4,1 3,1,4,4,1 4,1,4,4,1 返回1,2,3,4而不是4,因为每个输入行的窗口都会更改。

现在让我们指定一个整个分区的窗口:

LAST_VALUE

现在我们按预期获得use strict; use warnings; use Data::Dumper; my $current_pattern = 'else'; my $pattern_arrays = { 'pattern1' => [], 'pattern2' => [], 'pattern3' => [], 'pattern4' => [], 'pattern5' => [], 'else' => [], }; while ( my $line = <DATA> ) { chomp($line); # remove trailing '\n' from $line # See if we just read one of our 5 patterns. Remember that # as the $current_pattern, and proceed to the next line. if ( $line =~ /^(pattern1|pattern2|pattern3|pattern4|pattern5)$/ ) { $current_pattern = $line; next; # jump back to "while...", i.e. proceed to next line } # If we get here, we have some $current_pattern, which is one # of "pattern1" ... "pattern5" or "else". The $current_pattern # is only "else" at the beginning, when we haven't found a # pattern yet (i.e. only the first line in your case). # Push the $line to the array that belongs to the $current_pattern. push @{$pattern_arrays->{$current_pattern}}, $line; } # Pretty-print the arrays. $Data::Dumper::Sortkeys = 1; # Sort Data::Dumper output by keys print Data::Dumper->Dump( [$pattern_arrays], ['pattern_arrays'] ); __DATA__ (text) pattern1 pattern2 (m lines of text) pattern3 (2 lines of text) pattern1 pattern2 (x lines of text) pattern3 (this continues ~50-100 times where number of lines between pattern2 and pattern3 vary) ... pattern3 (5 lines of text) pattern4 (2 lines of text) pattern5 (text) 4。

答案 1 :(得分:2)

即使该问题标题使用LAST_VALUE - 问题本身也要求Largest Value
我会跟下面一起去:

SELECT x, MAX(y) OVER (PARTITION BY x) FROM table  

如果参与表中没有其他字段 - 我只会做简​​单的GROUP BY:

SELECT x, MAX(y) FROM table GROUP BY x 

当然,我们应该记住,并不总是最大值和MAX值是一回事。

答案 2 :(得分:0)

您拥有的其他选项是将查询顺序更改为desc

SELECT
  x,
  LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC)
FROM table
order by x desc

但是你只能获得第一行的最后一个值