我想在“莎士比亚”上的公众可访问的BigQuery数据库中找到给定单词的总单词数(根据示例->莎士比亚,也称为bigquery-public-data.samples.shakespeare
)。
模式如下:
Field name | Type | Mode | Description
---------------------------------------------------
word | STRING | REQUIRED | A single unique word (where whitespace is the delimiter) extracted from a corpus.
word_count | INTEGER | REQUIRED | The number of times this word appears in this corpus.
corpus | STRING | REQUIRED | The work from which this word was extracted.
corpus_date| INTEGER | REQUIRED | The year in which this corpus was published.
我已经成功地将所有corpus
的所有小写单词组合在一起,然后将所有corpus
实例组合到新列found_in
和SUM
中将他们的字数计入total_word_count
列中。
我的查询如下:
SELECT
STRING_AGG(DISTINCT corpus) AS found_in,
LOWER(word),
SUM(word_count) AS total_word_count
FROM
`bigquery-public-data.samples.shakespeare`
GROUP BY
LOWER(word)
ORDER BY
total_word_count DESC
LIMIT
1000
输出列是
Row found_in f0_ total_word_count
我的问题是重命名f0_
列。这是一个问题,因为我想将整个东西包装在另一个查询中,所以我可以做类似SELECT * FROM {{that previous query}} WHERE word="thou"
的事情。
我不明白的是
如何在我的WHERE
子句中引用“单词”。
如何像我对LOWER(word)
和(SUM)(使用STRING_AGG
一样)命名主要查询的AS
部分。
我尝试了以下操作:
SELECT
*
FROM
(
SELECT
STRING_AGG(DISTINCT corpus) AS found_in,
LOWER(word),
SUM(word_count) AS total_word_count
FROM
`bigquery-public-data.samples.shakespeare`
GROUP BY
LOWER(word)
ORDER BY
total_word_count DESC
LIMIT
1000
)
WHERE word = 'thou'
但是,我在最后一行出现错误:Unrecognized name: word
。
因此,我尝试使用AS
:
SELECT
*
FROM
(
SELECT
STRING_AGG(DISTINCT corpus) AS found_in,
LOWER(word) AS lowered_word,
SUM(word_count) AS total_word_count
FROM
`bigquery-public-data.samples.shakespeare`
GROUP BY
LOWER(word)
ORDER BY
total_word_count DESC
LIMIT
1000
)
WHERE word = 'and'
但是随后我在SELECT list expression references column word which is neither grouped nor aggregated
的行上得到了错误LOWER(word)
。
这使我感到困惑,因为我看到word
引用了GROUP BY
。
如何正确地引用LOWER(单词),以便在次级查询中引用它?
答案 0 :(得分:2)
我想这就是你想要的:
def ith_item_of_cartesian_product(*args, repeat=1, i=0):
pools = [tuple(pool) for pool in args] * repeat
len_product = len(pools[0])
for j in range(1,len(pools)):
len_product = len_product * len(pools[j])
if n >= len_product:
raise Exception("n is bigger than the length of the product")
i_list = []
for j in range(0, len(pools)):
ith_pool_index = i
denom = 1
for k in range(j+1, len(pools)):
denom = denom * len(pools[k])
ith_pool_index = ith_pool_index//denom
if j != 0:
ith_pool_index = ith_pool_index % len(pools[j])
i_list.append(ith_pool_index)
ith_item = []
for i in range(0, len(pools)):
ith_item.append(pools[i][i_list[i]])
return ith_item
注意:
SELECT *
FROM (SELECT STRING_AGG(DISTINCT corpus) AS found_in,
LOWER(word) AS lowered_word,
SUM(word_count) AS total_word_count
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY lowered_word
ORDER BY total_word_count DESC
LIMIT 1000
) w
WHERE lowered_word = 'and';
的内容,因此使用word
进行外部比较。lowered_word
似乎是任意的。我认为它不会提高性能或降低查询成本。