在BigQuery中过滤嵌套数组中的范围,然后对结果进行去重复

时间:2019-06-27 00:27:39

标签: google-bigquery

我正在使用BigQuery和嵌套表,而SQL不是我的强项。我对要解决的实际生产数据存在真正的问题,同时又试图将一些SQL / BQ概念引入我的脑海。

我的查询与Working with Arrays in Standard SQL页上的某些查询类似,但是对于我来说,查询还不够完善。

让我为您介绍一些示例数据,这些数据的结构与真实数据非常相似,然后描述我需要的数据。

基本上,我有两个表,我想用一个过滤另一个。

表1具有一些两层嵌套,可以这样构建:

WITH data AS (
    SELECT "Test 1" AS name, [STRUCT(1 AS id, [20, 21] AS results), STRUCT(2 AS id, [22, 23] AS results)] AS resultset
    UNION ALL
    SELECT "Test 2" AS name, [STRUCT(1 AS id, [23, 24] AS results), STRUCT(2 AS id, [25, 26] AS results)] AS resultset
    UNION ALL
    SELECT "Test 3" AS name, [STRUCT(1 AS id, [26, 27] AS results), STRUCT(2 AS id, [28, 29] AS results)] AS resultset
)
SELECT * FROM data

数字的含义无关紧要。重要的是表2包含要用于过滤表1的 ranges 。表2可以按以下方式构建:

ranges AS (
    SELECT "Range 1" AS title, 24.0 AS min, 25.0 AS max
    UNION ALL
    SELECT "Range 2" AS title, 26.0 AS min, 27.0 AS max
)
SELECT * from ranges

我要结束的是第一个表中的行,其中 any 结果与第二个表中的一个或多个范围匹配,但没有一个行没有匹配项。

我知道我可以对两个表进行一些UNNEST()和JOINing处理,以获得过滤后的结果,但是由于嵌套的原因,该结果将包含重复项:

WITH data AS (
  SELECT "Test 1" as name, [STRUCT(1 as id, [20, 21] as results), STRUCT(2 as id, [22, 23] as results)] as resultset
  UNION ALL
  SELECT "Test 2" as name, [STRUCT(1 as id, [23, 24] as results), STRUCT(2 as id, [25, 26] as results)] as resultset
  UNION ALL
  SELECT "Test 3" as name, [STRUCT(1 as id, [26, 27] as results), STRUCT(2 as id, [28, 29] as results)] as resultset
),
ranges AS (
  SELECT "Range 1" AS title, 24.0 as min, 25.0 as max
  UNION ALL
  SELECT "Range 2" AS title, 26.0 as min, 27.0 as max
)
SELECT data.*
FROM data, UNNEST(resultset), UNNEST(results) r
JOIN ranges
ON r BETWEEN min AND max

这就是我所拥有的:

Row     name    resultset.id    resultset.results

1       Test 2             1                   23
                                               24
                           2                   25
                                               26

2       Test 2             1                   23
                                               24
                           2                   25
                                               26

3       Test 2             1                   23
                                               24
                           2                   25
                                               26

4       Test 3             1                   26
                                               27
                           2                   28
                                               29

5       Test 3             1                   26
                                               27
                           2                   28
                                               29

想要是要在SELECT中调用DISTINCT数据。*可以将其缩减为两个唯一的行,并用它来完成。

换句话说,这就是我想要的:

Row     name    resultset.id    resultset.results

1       Test 2             1                   23
                                               24
                           2                   25
                                               26

2       Test 3             1                   26
                                               27
                           2                   28
                                               29

但是我不能用嵌套数据做到这一点。

所以,我有两个问题:

  1. 在这种情况下如何折叠相同的行?
  2. 我有没有带领自己走上错误的道路,还有没有更好的方法来实现这一目标?

关于数据:我不能更改第一个表。如果可以导致简单的解决方案,我可以使用第二张桌子。

2 个答案:

答案 0 :(得分:1)

以下是用于BigQuery标准SQL

最简单的解决方案是(不更改您已经拥有的查询核心),如下所示添加GROUP BY

#standardSQL
SELECT ANY_VALUE(data).*
FROM data, UNNEST(resultset), UNNEST(results) r
JOIN ranges ON r BETWEEN min AND max
GROUP BY TO_JSON_STRING(data)    
  

这有效!但是我不明白为什么。你能详细说明吗?

好的。

SELECT DISTINCT ... FROM ...在概念上等同于SELECT ... GROUP BY

所以,任务是为GROUP BY和相应的Aggregation函数(GROUP BY要求)找到合适的值

ANY_VALUETO_JSON_STRING(data)是我们在这里需要的

答案 1 :(得分:0)

尝试从数据集中选择所需的数据。该查询返回唯一但未嵌套的结果:

 SELECT data.name, rs.id, r
 FROM data
 left join UNNEST(resultset) rs
 left join UNNEST(results) as r
 JOIN ranges ON r BETWEEN min AND max