考虑这个架构:
key: REQUIRED INTEGER
description: NULLABLE STRING
field: REPEATED RECORD {
field.names: REQUIRED STRING
field.value: NULLABLE FLOAT
}
其中:key
按表唯一,field.names
实际上是以逗号分隔的属性列表(“property1”,“property2”,“property3”...)。
示例数据集(不注意实际值,它们仅用于演示结构):
{"key":1,"description":"Cool","field":[{"names":"\"Nice\",\"Wonderful\",\"Woohoo\"", "value":1.2},{"names":"\"Everything\",\"is\",\"Awesome\"", "value":20}]}
{"key":2,"description":"Stack","field":[{"names":"\"Overflow\",\"Exchange\",\"Nice\"", "value":2.0}]}
{"key":3,"description":"Iron","field":[{"names":"\"The\",\"Trooper\"", "value":666},{"names":"\"Aces\",\"High\",\"Awesome\"", "value":333}]}
我需要的是一次查询多个field.names
的值的方法。输出应该是这样的:
+-----+--------+-------+-------+-------+-------+
| key | desc | prop1 | prop2 | prop3 | prop4 |
+-----+--------+-------+-------+-------+-------+
| 1 | Desc 1 | 1.0 | 2.0 | 3.0 | 4.0 |
| 2 | Desc 2 | 4.0 | 3.0 | 2.0 | 1.0 |
| ... | | | | | |
+-----+--------+-------+-------+-------+-------+
如果同一个键包含具有相同查询名称的字段,则只应考虑第一个值。
到目前为止,这是我的查询:
select all.key as key, all.description as desc,
t1.col as prop1, t2.col as prop2, t3.col as prop3 //and so on...
from mydataset.mytable all
left join each
(select key, field.value as col from
mydataset.mytable
where lower(field.names) contains '"trooper"'
group each by key, col
) as t1 on all.key = t1.key
left join each
(select key, field.value as col from
mydataset.mytable
where lower(field.names) contains '"awesome"'
group each by key, col
) as t2 on all.key = t2.key
left join each
(select key, field.value as col from
mydataset.mytable
where lower(field.names) contains '"nice"'
group each by key, col
) as t3 on all.key = t3.key
//and so on...
此查询的输出为:
+-----+-------+-------+-------+-------+
| key | desc | prop1 | prop2 | prop3 |
+-----+-------+-------+-------+-------+
| 1 | Cool | null | 20.0 | 1.2 |
| 2 | Stack | null | null | 2.0 |
| 3 | Iron | 666.0 | 333.0 | null |
+-----+-------+-------+-------+-------+
所以我的问题是:这是要走的路吗?如果我的用户想要,比如我的表中的200个属性,我应该只进行200次自联接吗?考虑到表可以在数十亿行中增长,它是否可扩展?还有其他方法可以使用BigQuery吗?
感谢。
答案 0 :(得分:7)
一般来说,超过50个连接的查询可能会开始出现问题,尤其是在您加入大型表时。即使有重复的字段,您也希望尽可能尝试一次扫描表格。
有用的是要注意,当您使用重复字段查询表时,您实际上是在查询该表的半展平表示。您可以假装每个重复都是自己的行,并相应地应用过滤器,表达式和分组。
在这种情况下,我认为你可以通过一次扫描逃脱:
select
key,
desc,
max(if(lower(field.names) contains "trooper", field.value, null))
within record as prop1,
max(if(lower(field.names) contains "awesome", field.value, null))
within record as prop2,
...
from mydataset.mytable
在这种情况下,每个“prop”字段只选择与每个所需字段名称对应的值,如果不存在则返回null,然后使用“max”函数聚合这些结果。我假设每个键只出现一个字段名称,在这种情况下,特定的聚合函数并不重要,因为它只存在以折叠空值。但显然你应该根据需要将它换成更合适的东西。
“记录内”语法告诉BigQuery仅在记录中的重复字段上执行这些聚合,而不是在整个表中执行这些聚合,从而消除了在末尾需要“group by”子句。