在Biqquery中,如何使用标准Sql在Struct中匹配多个字段时过滤Struct数组?

时间:2017-06-28 21:11:29

标签: google-bigquery

这是表的记录布局(load_history)我试图使用标准Sql过滤器(因为遗留的sql可能会在某些时候过时):

[
{
    "mode": "NULLABLE",
    "name": "Job",
    "type": "RECORD",
    "fields": [
        {
          "mode": "NULLABLE",
          "name": "name",
          "type": "STRING"
        },
        {
          "mode": "NULLABLE",
          "name": "start_time",
          "type": "TIMESTAMP"
        },
        {
          "mode": "NULLABLE",
          "name": "end_time",
          "type": "TIMESTAMP"
        },
        {
    ]
},      
{
    "mode": "REPEATED",
    "name": "source",
    "type": "RECORD",
    "description": "source tables touched by this job",
    "fields": [     
        {
          "mode": "NULLABLE",
          "name": "database",
          "type": "STRING"
        },
        {
          "mode": "NULLABLE",
          "name": "schema",
          "type": "STRING"
        },
        {
          "mode": "NULLABLE",
          "name": "table",
          "type": "STRING"
        },
        {
          "mode": "NULLABLE",
          "name": "partition_time",
          "type": "TIMESTAMP"
        }    
    ]
}
]      

我需要过滤并选择只有数组“source”中有条目的记录,其中“schema”& “table”字段匹配某些值(例如,schema ='log'和table ='customer'在同一个数组条目中)。

以下仅在过滤Struct(模式名称)中的一个字段时起作用:

select name, array(select x from unnest(schema) as x where x ='log' ), table
from (select job.name , array(select schema from unnest(source)) as schema, 
      array(select table from unnest(source)) as table
      from  config.load_history)

但是,我无法在同一数组条目中过滤字段组合。

非常感谢您的帮助

4 个答案:

答案 0 :(得分:4)

for BigQuery Standard SQL

  
#standardSQL
SELECT data
FROM data, UNNEST(source) AS s
WHERE (s.schema, s.table) = ('log', 'customer')  

#standardSQL
SELECT *
FROM data
WHERE EXISTS (
  SELECT 1 FROM UNNEST(source) AS s 
  WHERE (s.schema, s.table) = ('log', 'customer')
)

您可以使用以下虚拟数据进行测试/播放

#standardSQL
WITH data AS (
  SELECT 
    STRUCT<name STRING, start_time INT64, end_time INT64>('jobA', 1, 2) AS job,
    [STRUCT<database STRING, schema STRING, table STRING, partition_time INT64>
      ('d1', 's1', 't1', 1), 
      ('d1', 's2', 't2', 2), 
      ('d1', 's3', 't3', 3) 
    ] AS source UNION ALL
  SELECT 
    STRUCT<name STRING, start_time INT64, end_time INT64>('jobB', 1, 2) AS job,
    [STRUCT<database STRING, schema STRING, table STRING, partition_time INT64>
      ('d1', 's1', 't1', 1), 
      ('d2', 's4', 't2', 2), 
      ('d2', 's3', 't3', 3) 
    ] AS source 
)
SELECT *
FROM data
WHERE EXISTS (
  SELECT 1 FROM UNNEST(source) AS s 
  WHERE (s.schema, s.table) = ('s2', 't2')
)

答案 1 :(得分:1)

听起来你想要这样的东西:

SELECT
  job.name,
  ARRAY(SELECT schema FROM UNNEST(matching_sources)) AS matching_schemas,
  ARRAY(SELECT table FROM UNNEST(matching_sources)) AS matching_tables
FROM (
  SELECT *,
    ARRAY(SELECT AS STRUCT * FROM UNNEST(sources)
          WHERE schema = 'log' AND `table` = 'customer') AS matching_sources
  FROM YourTable
)
WHERE ARRAY_LENGTH(matching_sources) > 0;

这将返回一个模式数组和一个表数组,两者都匹配条件,并排除数组中没有条目匹配条件的行。

答案 2 :(得分:0)

  

我需要过滤并选择只有数组“source”中有条目的记录,其中“schema”&amp; “table”字段匹配某些值

这听起来好像可以通过一个简单的WHERE子句来解决,如下所示:

WITH data AS(
  select STRUCT<name STRING, start_time TIMESTAMP, end_time TIMESTAMP> ('job_1', TIMESTAMP("2017-06-10"), TIMESTAMP("2017-06-11")) Job, ARRAY<STRUCT<database STRING, schema STRING, table STRING, partition_time TIMESTAMP> > [STRUCT('database_1', "schema_1", "table_1", TIMESTAMP("2017-06-10")), STRUCT('database_1', "schema_1", "table_2", TIMESTAMP("2017-06-10")), STRUCT('database_1', "schema_3", "table_1", TIMESTAMP("2017-06-10")), STRUCT('database_2', "schema_2", "table_2", TIMESTAMP("2017-06-10"))] source union all
  select STRUCT<name STRING, start_time TIMESTAMP, end_time TIMESTAMP> ('job_2', TIMESTAMP("2017-06-10"), TIMESTAMP("2017-06-11")) Job, ARRAY<STRUCT<database STRING, schema STRING, table STRING, partition_time TIMESTAMP> > [STRUCT('database_2', "schema_2", "table_2", TIMESTAMP("2017-06-10")), STRUCT('database_2', "schema_2", "table_3", TIMESTAMP("2017-06-10")), STRUCT('database_1', "schema_1", "table_3", TIMESTAMP("2017-06-10"))] source
)

SELECT
  *
FROM data
WHERE EXISTS(SELECT 1 FROM UNNEST(source) WHERE schema = "schema_2" AND table = "table_2")

这将返回所有行,在某些时候,这些行具有给定的模式和给定的表。

如果您想在输出中仅过滤掉匹配过滤器的记录,您也可以运行此命令:

SELECT
  job.*,
  ARRAY(SELECT AS STRUCT database, schema, table, partition_time FROM UNNEST(source) WHERE schema = "schema_2" AND table = "table_2") filtered_data
FROM data
  WHERE EXISTS(SELECT 1 FROM UNNEST(source) WHERE schema = "schema_2" AND table = "table_2")

不确定这是否与您的问题完全相符,但它可能会让您了解如何从ARRAY中过滤掉值。

答案 3 :(得分:0)

Mikhail-berlyant https://stackoverflow.com/users/5221944/mikhail-berlyant对此做了很好的解释 我用了第一个例子。

SELECT data
FROM data, UNNEST(source) AS s
WHERE (s.schema, s.table) = ('log', 'customer')  

让我在我的例子中解释一下: 如果我想从Google公共专利中获取具有具体每次点击费用代码的完全匹配行

通常情况下,我会使用“赞”条件

SELECT cpc
FROM
`patents-public-data.patents.publications`
where cpc like "%G01R31/007"

我不能以此为目的,因为CPC单元格包含一个数组列表[{'code':'G01R31 / 007','inventive':True,'first':False,'tree':[] }]

所以我需要将此数组划分为多个块,然后我要寻址到 code 标识符,并将查询与要提取的确切值等同起来-可能是 G01R31 / 007

以下代码:

SELECT publication_number, cpc
FROM `patents-public-data.patents.publications`, 
UNNEST(cpc) AS s
WHERE (s.code) = ('G01R31/007')