BQ:不支持引用其他表的相关子查询-ARRAY_AGG不可行

时间:2018-10-03 04:32:06

标签: google-bigquery

我们在BQ中充分利用了ARRAYSTRUCT,直到由于主题错误而无法更改数组内容的程度。 参见以下使用公共数据的简单示例。假设INNER JOIN无法使用,因为图像丢失,故意或错误。

现在,我知道通常可以将left join从数组重定义移至FROM子句并使用ARRAY_AGG,但这并不总是可能的。

在我们的例子中,要更新的数组之外的“其他字段”是其他数组或结构-例如github-nested表。

由于您无法在SELECT DISTINCTSTRUCT字段上执行ARRAY,因此最终需要UNNEST所有内容并使用许多ARRAY_AGG和大量资源从头开始重新创建表消耗和OOM风险。对于具有很多嵌套字段的表,这是不可能的。

SELECT
  * EXCEPT(webDetection),
  STRUCT(
    webDetection.partialMatchingImages,
    webDetection.pagesWithMatchingImages,
    webDetection.fullMatchingImages,
    ARRAY(
      SELECT AS STRUCT
        fmi.score,
        fmi.url,
        i.object_id
      FROM
        data.webDetection.fullMatchingImages fmi
      LEFT JOIN
        `bigquery-public-data.the_met.images` i
      ON
        fmi.url = i.original_image_url
      ) AS fullMatchingImages_from_met,
    webDetection.webEntities
  ) AS webDetection
FROM
  `bigquery-public-data.the_met.vision_api_data` data

任何想法如何避免重新汇总?

2 个答案:

答案 0 :(得分:2)

假设从概念上讲您的查询正确,并且唯一的问题是错误correlated subqueries that reference other tables are not supported-尝试替换下面的片段

FROM
  data.webDetection.fullMatchingImages fmi
LEFT JOIN
  `bigquery-public-data.the_met.images` i
ON
  fmi.url = i.original_image_url

FROM
  data.webDetection.fullMatchingImages fmi
CROSS JOIN
  `bigquery-public-data.the_met.images` i
WHERE
  fmi.url = i.original_image_url   
  

更新,添加不匹配的网址

SELECT * EXCEPT(webDetection),
  STRUCT(
    webDetection.partialMatchingImages,
    webDetection.pagesWithMatchingImages,
    webDetection.fullMatchingImages,
    ARRAY(
      SELECT AS STRUCT *      
      FROM t.webDetection.fullMatchingImages_from_met_temp
      UNION ALL
      SELECT AS STRUCT *, NULL
      FROM t.webDetection.fullMatchingImages
      WHERE NOT url IN (SELECT url FROM t.webDetection.fullMatchingImages_from_met_temp)
    ) AS fullMatchingImages_from_met,
    webDetection.webEntities
    ) AS webDetection
FROM (
  SELECT * EXCEPT(webDetection),
    STRUCT(
      webDetection.partialMatchingImages,
      webDetection.pagesWithMatchingImages,
      webDetection.fullMatchingImages,
      ARRAY(
        SELECT AS STRUCT
          fmi.score,
          fmi.url,
          i.object_id
        FROM data.webDetection.fullMatchingImages fmi
        JOIN `bigquery-public-data.the_met.images` i
        ON fmi.url = i.original_image_url
      ) AS fullMatchingImages_from_met_temp,
      webDetection.webEntities
    ) AS webDetection
  FROM `bigquery-public-data.the_met.vision_api_data` data
) t 

答案 1 :(得分:0)

要扩展上述答案,可能是查询(对我来说是另一个查询)失败了,因为优化器仍然认为子查询太复杂了。

在这种情况下,请尝试避免使用UNION ALL并使用ARRAY_CONCAT()

SELECT * EXCEPT(webDetection),

  STRUCT(
    webDetection.partialMatchingImages,
    webDetection.pagesWithMatchingImages,
    webDetection.fullMatchingImages,

    ARRAY_CONCAT(
     ARRAY(
      SELECT AS STRUCT *      
      FROM t.webDetection.fullMatchingImages_from_met_temp
     ),
     ARRAY(
      SELECT AS STRUCT *, NULL
      FROM t.webDetection.fullMatchingImages
      WHERE NOT url IN (SELECT url FROM t.webDetection.fullMatchingImages_from_met_temp)
     ) 
    ) AS fullMatchingImages_from_met, 

    webDetection.webEntities
    ) AS webDetection
FROM (
  SELECT * EXCEPT(webDetection),
    STRUCT(
      webDetection.partialMatchingImages,
      webDetection.pagesWithMatchingImages,
      webDetection.fullMatchingImages,
      ARRAY(
        SELECT AS STRUCT
          fmi.score,
          fmi.url,
          i.object_id
        FROM data.webDetection.fullMatchingImages fmi
        JOIN `bigquery-public-data.the_met.images` i
        ON fmi.url = i.original_image_url
      ) AS fullMatchingImages_from_met_temp,
      webDetection.webEntities
    ) AS webDetection
  FROM `bigquery-public-data.the_met.vision_api_data` data
) t 

BQ接受了这一事实,有趣的是,它甚至比UNION ALL还要快!与ARRAY(... INNER JOIN...)

几乎相同的运行时

另一方面,即使使用了一些冗长的变通办法-可能不会持续很久-BigQuery优化程序也需要进一步调整。参考原始错误消息Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN,简单的LEFT JOIN对我来说是非常有效的...

@readers,仅供参考,here已提交错误。确保对其“加注星标”以提高其优先级!