我希望使用如下查询在有序分区上找到y
的最后一个值:
SELECT
x,
LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC)
FROM table
但是LAST_VALUE
会返回许多值,这些值不是给定分区的y
的最后一个值(在这种情况下,最大值)。为什么呢?
(在这种情况下,可以使用MAX
代替LAST_VALUE
来查找最大值,但为什么LAST_VALUE
也不会返回最大值?)
答案 0 :(得分:22)
TLDR:您想要的查询是:
SELECT
x,
LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM table
可能后跟GROUP BY
以折叠分析函数中的重复输出行。
当然,如果只需要MAX
,那么只使用SELECT
x,
MAX(y) OVER (PARTITION BY x)
FROM table
就可以了:
PARTITION BY
在回答这个问题之前,这里有一些关于分析函数的背景知识(a.k.a. window functions)。以下所有内容均为标准SQL,并非特定于BigQuery。
首先,分析函数不是聚合函数。聚合函数将多个输入行折叠为单个输出行,而解析函数仅为每个输入行计算一个输出行。因此,您需要确保考虑每个输入行的输出。
其次,分析功能在"窗口上运行"作为"分区的子集的行"该行所属的行。输入行的分区由ROWS
子句确定,或者如果希望分区是整个输入行集,则可以省略它。该窗口由ORDER BY
子句给出,但如果您没有指定它(并且用户通常不会),则默认为整个分区(未应用排序时)或从第一行到当前行的分区中的行集(当存在LAST_VALUE
时)。请注意,窗口可能因分区中的每个输入行而不同!
现在,回到LAST_VALUE
。虽然上面描述的默认窗口在许多情况下是合理的(例如,计算累积总和),但它对LAST_VALUE
的效果非常差。 LAST_VALUE
函数返回窗口中最后一行的值,默认情况下,窗口中的最后一行是当前行。
因此,要解决此问题,您需要明确指定SELECT x, LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM table
的窗口是整个分区,而不仅仅是当前行的行。你可以这样做:
SELECT
x,
FIRST_VALUE(x) OVER (ORDER BY x ASC) first_asc,
FIRST_VALUE(x) OVER (ORDER BY x DESC) first_desc,
LAST_VALUE(x) OVER (ORDER BY x ASC) last_asc,
LAST_VALUE(x) OVER (ORDER BY x DESC) last_desc,
FROM
(SELECT 4 as x),
(SELECT 2 as x),
(SELECT 1 as x),
(SELECT 3 as x)
x,first_asc,first_desc,last_asc,last_desc
1,1,4,1,1
2,1,4,2,2
3,1,4,3,3
4,1,4,4,4
为了测试这个,这里有一个例子:
LAST_VALUE
请注意SELECT
x,
FIRST_VALUE(x) OVER (ORDER BY x ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) first_asc,
FIRST_VALUE(x) OVER (ORDER BY x DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) first_desc,
LAST_VALUE(x) OVER (ORDER BY x ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) last_asc,
LAST_VALUE(x) OVER (ORDER BY x DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) last_desc,
FROM
(SELECT 4 as x),
(SELECT 2 as x),
(SELECT 1 as x),
(SELECT 3 as x)
x,first_asc,first_desc,last_asc,last_desc
1,1,4,4,1
2,1,4,4,1
3,1,4,4,1
4,1,4,4,1
返回1,2,3,4而不是4,因为每个输入行的窗口都会更改。
现在让我们指定一个整个分区的窗口:
LAST_VALUE
现在我们按预期获得use strict;
use warnings;
use Data::Dumper;
my $current_pattern = 'else';
my $pattern_arrays = {
'pattern1' => [],
'pattern2' => [],
'pattern3' => [],
'pattern4' => [],
'pattern5' => [],
'else' => [],
};
while ( my $line = <DATA> ) {
chomp($line); # remove trailing '\n' from $line
# See if we just read one of our 5 patterns. Remember that
# as the $current_pattern, and proceed to the next line.
if ( $line =~ /^(pattern1|pattern2|pattern3|pattern4|pattern5)$/ ) {
$current_pattern = $line;
next; # jump back to "while...", i.e. proceed to next line
}
# If we get here, we have some $current_pattern, which is one
# of "pattern1" ... "pattern5" or "else". The $current_pattern
# is only "else" at the beginning, when we haven't found a
# pattern yet (i.e. only the first line in your case).
# Push the $line to the array that belongs to the $current_pattern.
push @{$pattern_arrays->{$current_pattern}}, $line;
}
# Pretty-print the arrays.
$Data::Dumper::Sortkeys = 1; # Sort Data::Dumper output by keys
print Data::Dumper->Dump( [$pattern_arrays], ['pattern_arrays'] );
__DATA__
(text)
pattern1
pattern2
(m lines of text)
pattern3
(2 lines of text)
pattern1
pattern2
(x lines of text)
pattern3
(this continues ~50-100 times where number of lines between pattern2 and pattern3 vary)
...
pattern3
(5 lines of text)
pattern4
(2 lines of text)
pattern5
(text)
4。
答案 1 :(得分:2)
即使该问题标题使用LAST_VALUE
- 问题本身也要求Largest Value
!
我会跟下面一起去:
SELECT x, MAX(y) OVER (PARTITION BY x) FROM table
如果参与表中没有其他字段 - 我只会做简单的GROUP BY:
SELECT x, MAX(y) FROM table GROUP BY x
当然,我们应该记住,并不总是最大值和MAX值是一回事。
答案 2 :(得分:0)
您拥有的其他选项是将查询顺序更改为desc
SELECT
x,
LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC)
FROM table
order by x desc
但是你只能获得第一行的最后一个值