Question

我想在“莎士比亚”上的公众可访问的BigQuery数据库中找到给定单词的总单词数（根据示例->莎士比亚，也称为bigquery-public-data.samples.shakespeare）。

模式如下：

Field name | Type     | Mode      |  Description
---------------------------------------------------
word       | STRING   | REQUIRED  | A single unique word (where whitespace is the delimiter) extracted from a corpus.
word_count | INTEGER  | REQUIRED  | The number of times this word appears in this corpus.
corpus     | STRING   | REQUIRED  | The work from which this word was extracted.
corpus_date| INTEGER  | REQUIRED  | The year in which this corpus was published.

我已经成功地将所有corpus的所有小写单词组合在一起，然后将所有corpus实例组合到新列found_in和SUM中将他们的字数计入total_word_count列中。

我的查询如下：

SELECT
  STRING_AGG(DISTINCT corpus) AS found_in,
  LOWER(word),
  SUM(word_count) AS total_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
GROUP BY
  LOWER(word)
ORDER BY
  total_word_count DESC
LIMIT
  1000

输出列是

Row     found_in    f0_     total_word_count

我的问题是重命名f0_列。这是一个问题，因为我想将整个东西包装在另一个查询中，所以我可以做类似SELECT * FROM {{that previous query}} WHERE word="thou"的事情。

我不明白的是

如何在我的WHERE子句中引用“单词”。
如何像我对LOWER(word)和（SUM）（使用STRING_AGG一样）命名主要查询的AS部分。

我尝试了以下操作：

SELECT
* 
FROM
(
SELECT
  STRING_AGG(DISTINCT corpus) AS found_in,
  LOWER(word),
  SUM(word_count) AS total_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
GROUP BY
  LOWER(word)
ORDER BY
  total_word_count DESC
LIMIT
  1000
)
WHERE word = 'thou'

但是，我在最后一行出现错误：Unrecognized name: word。

因此，我尝试使用AS：

SELECT
* 
FROM
(
SELECT
  STRING_AGG(DISTINCT corpus) AS found_in,
  LOWER(word) AS lowered_word,
  SUM(word_count) AS total_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
GROUP BY
  LOWER(word)
ORDER BY
  total_word_count DESC
LIMIT
  1000
)
WHERE word = 'and'

但是随后我在SELECT list expression references column word which is neither grouped nor aggregated的行上得到了错误LOWER(word)。

这使我感到困惑，因为我看到word引用了GROUP BY。

如何正确地引用LOWER（单词），以便在次级查询中引用它？

Answer 1

我想这就是你想要的：

def ith_item_of_cartesian_product(*args, repeat=1, i=0):
    pools = [tuple(pool) for pool in args] * repeat   
    len_product = len(pools[0])
    for j in range(1,len(pools)):
        len_product = len_product * len(pools[j])
    if n >= len_product:
        raise Exception("n is bigger than the length of the product")
    i_list = []
    for j in range(0, len(pools)):
        ith_pool_index = i
        denom = 1
        for k in range(j+1, len(pools)):
            denom = denom * len(pools[k])
        ith_pool_index = ith_pool_index//denom
        if j != 0:
            ith_pool_index = ith_pool_index % len(pools[j])
        i_list.append(ith_pool_index)
    ith_item = []
    for i in range(0, len(pools)):
        ith_item.append(pools[i][i_list[i]])
    return ith_item

注意：

子查询没有产生称为SELECT * FROM (SELECT STRING_AGG(DISTINCT corpus) AS found_in, LOWER(word) AS lowered_word, SUM(word_count) AS total_word_count FROM `bigquery-public-data.samples.shakespeare` GROUP BY lowered_word ORDER BY total_word_count DESC LIMIT 1000 ) w WHERE lowered_word = 'and';的内容，因此使用word进行外部比较。
您可以按列别名在BigQuery中进行汇总。
子查询中的lowered_word似乎是任意的。我认为它不会提高性能或降低查询成本。

如何使用BigQuery SQL语法正确地“ AS”应用了“ LOWER”功能的列，以便在包装“ SELECT”中对其进行引用？

1 个答案: