Question

我正在尝试使用一种分析功能来获取专利申请量排名前2的国家，而在那些排名前2位的国家中，则申请专利的前2个国家。例如，答案将如下所示：

country  -   code 
US           P
US           A
GB           X
GB           P

这是我正在使用的查询：

SELECT
  country_code,
  MIN(count_country_code) count_country_code,
  application_kind
FROM (
  WITH
    A AS (
    SELECT
      country_code,
      COUNT(country_code) OVER (PARTITION BY country_code) AS count_country_code,
      application_kind
    FROM
      `patents-public-data.patents.publications`),
    B AS (
    SELECT
      country_code,
      count_country_code,
      DENSE_RANK() OVER(ORDER BY count_country_code DESC) AS country_code_num,
      application_kind,
      DENSE_RANK() OVER(PARTITION BY country_code ORDER BY count_country_code DESC) AS application_kind_num
    FROM
      A)
  SELECT
    country_code,
    count_country_code,
    application_kind
  FROM
    B
  WHERE
    country_code_num <= 2
    AND application_kind_num <= 2) x
GROUP BY
  country_code,
  application_kind
ORDER BY
  count_country_code DESC

但是，不幸的是，由于超量/订单/分区，我收到“超出内存”错误。这是消息：

查询执行期间超出了资源：无法在分配的内存中执行查询。高峰使用：限制的112％。内存消耗最大的用户：用于分析OVER（）子句的排序操作：98％其他/未分配：2％

如何进行上述查询（或其他类似查询）而又不会遇到这些内存错误？可以在公共数据集here上对此进行测试。

一种粗略的方法（仅当字段的基数为半低时才有效）是将其作为简单的聚合操作并将结果存储在数据库外部的内存中。例如：

Answer 1

以下是用于BigQuery标准SQL

#standardSQL
WITH A AS (
  SELECT country_code
  FROM `patents-public-data.patents.publications`
  GROUP BY country_code
  ORDER BY COUNT(1) DESC
  LIMIT 2
), B AS (
  SELECT
    country_code,
    application_kind,
    COUNT(1) application_kind_count
  FROM `patents-public-data.patents.publications`
  WHERE country_code IN (SELECT country_code FROM A)
  GROUP BY country_code, application_kind
), C AS (
  SELECT
    country_code,
    application_kind,
    application_kind_count,
    DENSE_RANK() OVER(PARTITION BY country_code ORDER BY application_kind_count DESC) AS application_kind_rank
  FROM B
)
SELECT
  country_code,
  application_kind,
  application_kind_count
FROM C
WHERE application_kind_rank <= 2

有结果

在BQ公共数据集中获取顶级专利国家/地区代码

1 个答案: