在BQ公共数据集中获取顶级专利国家/地区代码

时间:2019-05-23 22:52:55

标签: sql google-bigquery

我正在尝试使用一种分析功能来获取专利申请量排名前2的国家,而在那些排名前2位的国家中,则申请专利的前2个国家。例如,答案将如下所示:

country  -   code 
US           P
US           A
GB           X
GB           P

这是我正在使用的查询:

SELECT
  country_code,
  MIN(count_country_code) count_country_code,
  application_kind
FROM (
  WITH
    A AS (
    SELECT
      country_code,
      COUNT(country_code) OVER (PARTITION BY country_code) AS count_country_code,
      application_kind
    FROM
      `patents-public-data.patents.publications`),
    B AS (
    SELECT
      country_code,
      count_country_code,
      DENSE_RANK() OVER(ORDER BY count_country_code DESC) AS country_code_num,
      application_kind,
      DENSE_RANK() OVER(PARTITION BY country_code ORDER BY count_country_code DESC) AS application_kind_num
    FROM
      A)
  SELECT
    country_code,
    count_country_code,
    application_kind
  FROM
    B
  WHERE
    country_code_num <= 2
    AND application_kind_num <= 2) x
GROUP BY
  country_code,
  application_kind
ORDER BY
  count_country_code DESC

但是,不幸的是,由于超量/订单/分区,我收到“超出内存”错误。这是消息:

  

查询执行期间超出了资源:无法在分配的内存中执行查询。高峰使用:限制的112%。内存消耗最大的用户:用于分析OVER()子句的排序操作:98%其他/未分配:2%

如何进行上述查询(或其他类似查询)而又不会遇到这些内存错误?可以在公共数据集here上对此进行测试。

一种粗略的方法(仅当字段的基数为半低时才有效)是将其作为简单的聚合操作并将结果存储在数据库外部的内存中。例如:

enter image description here

1 个答案:

答案 0 :(得分:2)

以下是用于BigQuery标准SQL

#standardSQL
WITH A AS (
  SELECT country_code
  FROM `patents-public-data.patents.publications`
  GROUP BY country_code
  ORDER BY COUNT(1) DESC
  LIMIT 2
), B AS (
  SELECT
    country_code,
    application_kind,
    COUNT(1) application_kind_count
  FROM `patents-public-data.patents.publications`
  WHERE country_code IN (SELECT country_code FROM A)
  GROUP BY country_code, application_kind
), C AS (
  SELECT
    country_code,
    application_kind,
    application_kind_count,
    DENSE_RANK() OVER(PARTITION BY country_code ORDER BY application_kind_count DESC) AS application_kind_rank
  FROM B
)
SELECT
  country_code,
  application_kind,
  application_kind_count
FROM C
WHERE application_kind_rank <= 2  

有结果

enter image description here