如何使用BigQuery SQL语法正确地“ AS”应用了“ LOWER”功能的列,以便在包装“ SELECT”中对其进行引用?

时间:2019-06-12 19:24:05

标签: sql google-bigquery

我想在“莎士比亚”上的公众可访问的BigQuery数据库中找到给定单词的总单词数(根据示例->莎士比亚,也称为bigquery-public-data.samples.shakespeare)。

模式如下:

Field name | Type     | Mode      |  Description
---------------------------------------------------
word       | STRING   | REQUIRED  | A single unique word (where whitespace is the delimiter) extracted from a corpus.
word_count | INTEGER  | REQUIRED  | The number of times this word appears in this corpus.
corpus     | STRING   | REQUIRED  | The work from which this word was extracted.
corpus_date| INTEGER  | REQUIRED  | The year in which this corpus was published.

我已经成功地将所有corpus的所有小写单词组合在一起,然后将所有corpus实例组合到新列found_inSUM中将他们的字数计入total_word_count列中。

我的查询如下:

SELECT
  STRING_AGG(DISTINCT corpus) AS found_in,
  LOWER(word),
  SUM(word_count) AS total_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
GROUP BY
  LOWER(word)
ORDER BY
  total_word_count DESC
LIMIT
  1000

输出列是

Row     found_in    f0_     total_word_count 

我的问题是重命名f0_列。这是一个问题,因为我想将整个东西包装在另一个查询中,所以我可以做类似SELECT * FROM {{that previous query}} WHERE word="thou"的事情。

我不明白的是

  1. 如何在我的WHERE子句中引用“单词”。

  2. 如何像我对LOWER(word)和(SUM)(使用STRING_AGG一样)命名主要查询的AS部分。

我尝试了以下操作:

SELECT
* 
FROM
(
SELECT
  STRING_AGG(DISTINCT corpus) AS found_in,
  LOWER(word),
  SUM(word_count) AS total_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
GROUP BY
  LOWER(word)
ORDER BY
  total_word_count DESC
LIMIT
  1000
)
WHERE word = 'thou'

但是,我在最后一行出现错误:Unrecognized name: word

因此,我尝试使用AS

SELECT
* 
FROM
(
SELECT
  STRING_AGG(DISTINCT corpus) AS found_in,
  LOWER(word) AS lowered_word,
  SUM(word_count) AS total_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
GROUP BY
  LOWER(word)
ORDER BY
  total_word_count DESC
LIMIT
  1000
)
WHERE word = 'and'

但是随后我在SELECT list expression references column word which is neither grouped nor aggregated的行上得到了错误LOWER(word)

这使我感到困惑,因为我看到word引用了GROUP BY

如何正确地引用LOWER(单词),以便在次级查询中引用它?

1 个答案:

答案 0 :(得分:2)

我想这就是你想要的:

def ith_item_of_cartesian_product(*args, repeat=1, i=0):
    pools = [tuple(pool) for pool in args] * repeat   
    len_product = len(pools[0])
    for j in range(1,len(pools)):
        len_product = len_product * len(pools[j])
    if n >= len_product:
        raise Exception("n is bigger than the length of the product")
    i_list = []
    for j in range(0, len(pools)):
        ith_pool_index = i
        denom = 1
        for k in range(j+1, len(pools)):
            denom = denom * len(pools[k])
        ith_pool_index = ith_pool_index//denom
        if j != 0:
            ith_pool_index = ith_pool_index % len(pools[j])
        i_list.append(ith_pool_index)
    ith_item = []
    for i in range(0, len(pools)):
        ith_item.append(pools[i][i_list[i]])
    return ith_item

注意:

  • 子查询没有产生称为SELECT * FROM (SELECT STRING_AGG(DISTINCT corpus) AS found_in, LOWER(word) AS lowered_word, SUM(word_count) AS total_word_count FROM `bigquery-public-data.samples.shakespeare` GROUP BY lowered_word ORDER BY total_word_count DESC LIMIT 1000 ) w WHERE lowered_word = 'and'; 的内容,因此使用word进行外部比较。
  • 您可以按列别名在BigQuery中进行汇总。
  • 子查询中的lowered_word似乎是任意的。我认为它不会提高性能或降低查询成本。