我正在尝试使用一种分析功能来获取专利申请量排名前2的国家,而在那些排名前2位的国家中,则申请专利的前2个国家。例如,答案将如下所示:
country - code
US P
US A
GB X
GB P
这是我正在使用的查询:
SELECT
country_code,
MIN(count_country_code) count_country_code,
application_kind
FROM (
WITH
A AS (
SELECT
country_code,
COUNT(country_code) OVER (PARTITION BY country_code) AS count_country_code,
application_kind
FROM
`patents-public-data.patents.publications`),
B AS (
SELECT
country_code,
count_country_code,
DENSE_RANK() OVER(ORDER BY count_country_code DESC) AS country_code_num,
application_kind,
DENSE_RANK() OVER(PARTITION BY country_code ORDER BY count_country_code DESC) AS application_kind_num
FROM
A)
SELECT
country_code,
count_country_code,
application_kind
FROM
B
WHERE
country_code_num <= 2
AND application_kind_num <= 2) x
GROUP BY
country_code,
application_kind
ORDER BY
count_country_code DESC
但是,不幸的是,由于超量/订单/分区,我收到“超出内存”错误。这是消息:
查询执行期间超出了资源:无法在分配的内存中执行查询。高峰使用:限制的112%。内存消耗最大的用户:用于分析OVER()子句的排序操作:98%其他/未分配:2%
如何进行上述查询(或其他类似查询)而又不会遇到这些内存错误?可以在公共数据集here上对此进行测试。
一种粗略的方法(仅当字段的基数为半低时才有效)是将其作为简单的聚合操作并将结果存储在数据库外部的内存中。例如:
答案 0 :(得分:2)
以下是用于BigQuery标准SQL
#standardSQL
WITH A AS (
SELECT country_code
FROM `patents-public-data.patents.publications`
GROUP BY country_code
ORDER BY COUNT(1) DESC
LIMIT 2
), B AS (
SELECT
country_code,
application_kind,
COUNT(1) application_kind_count
FROM `patents-public-data.patents.publications`
WHERE country_code IN (SELECT country_code FROM A)
GROUP BY country_code, application_kind
), C AS (
SELECT
country_code,
application_kind,
application_kind_count,
DENSE_RANK() OVER(PARTITION BY country_code ORDER BY application_kind_count DESC) AS application_kind_rank
FROM B
)
SELECT
country_code,
application_kind,
application_kind_count
FROM C
WHERE application_kind_rank <= 2
有结果