我有一个表,其中包含id作为单列和嵌套的多列。
1)
更好理解的示例架构:
id-字符串,
childrenNames-重复的字符串,
animalNames-重复的字符串,
另一个表仅包含单个列
2)
更好理解的示例架构:
childrenName-字符串,
animalName-字符串
我需要知道表2)中所有不在表1)中的记录 因此,childrenName和animalName都需要属于一个用户。
我可以补充一点,我试图为表2)中的每一列分别选择一个值,这些值是表1中的'IN'列表),但是如果它返回任何行,则也可能意味着这两个都属于到两个不同的ID(或更多)。
示例行表1)
id:1234,
childrenNames:['Ana','Frank'],
animalNames:['Rex','Max'],
示例行表2)
A)
childrenName:'Ana',
animalName:'Ozzy'
B)
childrenName:'Frank',
animalName:“雷克斯”
对于上述示例,我应该从表2)中获得A)行,因为“奥兹”不属于ID 1234(假设我们在表1中没有更多记录))
有人知道如何使用BigQuery SQL(标准或旧版)解决此类问题吗?
答案 0 :(得分:1)
以下是用于BigQuery标准SQL
#standardSQL
SELECT childrenName, animalName, ARRAY_AGG(DISTINCT id) users
FROM `project.dataset.table2`
CROSS JOIN `project.dataset.table1`
WHERE (SELECT COUNT(1) FROM UNNEST(childrenNames) cn WHERE cn = childrenName) > 0
AND (SELECT COUNT(1) FROM UNNEST(animalNames) an WHERE an = animalName) > 0
GROUP BY childrenName, animalName
您可以使用问题的数据示例进行测试,操作
#standardSQL
WITH `project.dataset.table1` AS (
SELECT '1' id, ['Ana', 'Frank'] childrenNames, ['Rex', 'Max'] animalNames
), `project.dataset.table2` AS (
SELECT 'Ana' childrenName, 'Ozzy' animalName UNION ALL
SELECT 'Frank', 'Rex'
)
SELECT childrenName, animalName, ARRAY_AGG(DISTINCT id) users
FROM `project.dataset.table2`
CROSS JOIN `project.dataset.table1`
WHERE (SELECT COUNT(1) FROM UNNEST(childrenNames) cn WHERE cn = childrenName) > 0
AND (SELECT COUNT(1) FROM UNNEST(animalNames) an WHERE an = animalName) > 0
GROUP BY childrenName, animalName
有结果
Row childrenName animalName users
1 Frank Rex 1
注意:输出中的字段users
是重复的字符串/数组,由具有搜索对的用户列表组成
上面的不太详细的变化将是
#standardSQL
SELECT childrenName, animalName, ARRAY_AGG(DISTINCT id) users
FROM `project.dataset.table2`
CROSS JOIN `project.dataset.table1`
WHERE childrenName IN UNNEST(childrenNames)
AND animalName IN UNNEST(animalNames)
GROUP BY childrenName, animalName
结果完全相同
所以,显然-使用第二个:o)
...表1)有500万条记录,表2)200k-因此
Query exceeded resource limits
尝试低于版本
#standardSQL
WITH flatten_table1 AS (
SELECT id, childrenName, animalName
FROM `project.dataset.table1`,
UNNEST(childrenNames) childrenName,
UNNEST(animalNames) animalName
)
SELECT childrenName, animalName, id
FROM `project.dataset.table2`
JOIN flatten_table1
USING(childrenName, animalName)