我在BigQuery中遇到麻烦让运行总计为我工作。
我找到了一个适用于此的示例: BigQuery SQL running totals
SELECT word, word_count, SUM(word_count) OVER(ORDER BY word DESC)
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'hamlet'
AND word > 'a' LIMIT 30
但我真正想做的是计算覆盖总word_count 80%的最受欢迎单词的数量。所以我试着在word_count首先订购时计算运行总数:
SELECT word, word_count, SUM(word_count) OVER(ORDER BY word_count DESC)
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'hamlet'
AND word > 'a' LIMIT 30
但我明白了:
Row word word_count f0_
1 o'er 18 18
2 answer 13 31
3 meet 8 39
4 told 5 44
5 treason 4 **52**
6 quality 4 **52**
7 brave 3 55
运行总数不会从第5行增加到第6行。可能是因为在两种情况下word_count都是4.
我做错了什么?
也许有更好的方法?我的计划是计算跑步总数。然后除以sum(word_count)OVER()并仅过滤少于80%的行。然后计算这些行的数量。
答案 0 :(得分:3)
首先,删除“LIMIT 30” - 它将干扰OVER()子句。
你想要一个比例?尝试RATIO_TO_REPORT:
SELECT word, word_count, RATIO_TO_REPORT(word_count) OVER(ORDER BY word_count DESC)
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'hamlet'
AND word > 'a'
您是否希望具有相同值的连续行仍然增加?使用次要订单确定这些行的订单:
SELECT word, word_count, RATIO_TO_REPORT(word_count) OVER(ORDER BY word_count DESC, word)
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'hamlet'
AND word > 'a'
你想要最受欢迎的单词覆盖80%吗?取这些比率,总结它们,然后过滤掉其余部分:
SELECT word, word_count, sum_ratio
FROM (
SELECT word, word_count, SUM(ratio) OVER(ORDER BY ratio, word) sum_ratio
FROM (
SELECT word, word_count, RATIO_TO_REPORT(word_count) OVER(ORDER BY word_count DESC, word) ratio
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'hamlet'
AND word > 'a'
)
)
WHERE sum_ratio>0.8
Row word word_count sum_ratio
1 is 313 0.8125175752219499
2 it 361 0.827019644076648
3 in 400 0.8430884184308841
4 my 441 0.8608042421564295
5 you 499 0.8808500381633391
6 of 630 0.906158357771261
7 to 635 0.9316675370586108
8 and 706 0.9600289237938375
9 the 995 0.9999999999999999