我尝试使用新的标准SQL在Google BigQuery表中的结构数组中查找包含重复字段的行。表中的数据(简化),其中每行看起来有点像这样:
{
"Session": "abc123",
"Information" [
{
"Identifier": "e8d971a4-ef33-4ea1-8627-f1213e4c67dc"
},
{
"Identifier": "1c62813f-7ec4-4968-b18b-d1eb8f4d9d26"
},
{
"Identifier": "e8d971a4-ef33-4ea1-8627-f1213e4c67dc"
}
]
}
我的最终目标是显示包含Information
个实体且存在重复Identifier
值的行。但是,我尝试的大多数查询都会收到以下格式的错误消息:
Cannot access field Identifier on a value with type ARRAY<STRUCT<Identifier STRING>>
有没有办法处理STRUCT
内ARRAY
内的数据?
这是我第一次尝试查询:
SELECT
Session,
Information
FROM
`events.myevents`
WHERE
COUNT(DISTINCT Information.Identifier) != ARRAY_LENGTH(Information.Identifier)
LIMIT
1000
另一个使用子查询:
SELECT
Session,
Information
FROM (
SELECT
Session,
Information,
COUNT(DISTINCT Information.Identifier) AS info_count_distinct,
ARRAY_LENGTH(Information) AS info_count
FROM
`events.myevents`
WHERE
COUNT(DISTINCT Information.Identifier) != ARRAY_LENGTH(Information.Identifier)
LIMIT
1000)
WHERE
info_count != info_count_distinct
答案 0 :(得分:4)
尝试以下
SELECT Session, Identifier, COUNT(1) AS dups
FROM `events.myevents`, UNNEST(Information)
GROUP BY Session, Identifier
HAVING dups > 1
ORDER BY Session
应该给你你期望的加上重复次数 如下(示例)
Session Identifier dups
abc123 e8d971a4-ef33-4ea1-8627-f1213e4c67dc 2
abc345 1c62813f-7ec4-4968-b18b-d1eb8f4d9d26 3