从多个重复字段中获取数据时,是否自行加入BigQuery的方式?

时间:2014-10-22 02:23:41

标签: google-bigquery

考虑这个架构:

key: REQUIRED INTEGER
description: NULLABLE STRING
field: REPEATED RECORD {
    field.names: REQUIRED STRING
    field.value: NULLABLE FLOAT
}

其中:key按表唯一,field.names实际上是以逗号分隔的属性列表(“property1”,“property2”,“property3”...)。

示例数据集(不注意实际值,它们仅用于演示结构):

{"key":1,"description":"Cool","field":[{"names":"\"Nice\",\"Wonderful\",\"Woohoo\"", "value":1.2},{"names":"\"Everything\",\"is\",\"Awesome\"", "value":20}]}
{"key":2,"description":"Stack","field":[{"names":"\"Overflow\",\"Exchange\",\"Nice\"", "value":2.0}]}
{"key":3,"description":"Iron","field":[{"names":"\"The\",\"Trooper\"", "value":666},{"names":"\"Aces\",\"High\",\"Awesome\"", "value":333}]}

我需要的是一次查询多个field.names的值的方法。输出应该是这样的:

+-----+--------+-------+-------+-------+-------+
| key |  desc  | prop1 | prop2 | prop3 | prop4 |
+-----+--------+-------+-------+-------+-------+
| 1   | Desc 1 | 1.0   | 2.0   | 3.0   | 4.0   |
| 2   | Desc 2 | 4.0   | 3.0   | 2.0   | 1.0   |
| ... |        |       |       |       |       |
+-----+--------+-------+-------+-------+-------+

如果同一个键包含具有相同查询名称的字段,则只应考虑第一个值。

到目前为止,这是我的查询:

select all.key as key, all.description as desc, 
t1.col as prop1, t2.col as prop2, t3.col as prop3 //and so on...

from mydataset.mytable all

left join each 
(select key, field.value as col from 
mydataset.mytable
where lower(field.names) contains '"trooper"'
group each by key, col
) as t1 on all.key = t1.key

left join each 
(select key, field.value as col from 
mydataset.mytable
where lower(field.names) contains '"awesome"'
group each by key, col
) as t2 on all.key = t2.key

left join each 
(select key, field.value as col from 
mydataset.mytable
where lower(field.names) contains '"nice"'
group each by key, col
) as t3 on all.key = t3.key

//and so on...

此查询的输出为:

+-----+-------+-------+-------+-------+
| key | desc  | prop1 | prop2 | prop3 |
+-----+-------+-------+-------+-------+
|   1 | Cool  | null  | 20.0  | 1.2   |
|   2 | Stack | null  | null  | 2.0   |
|   3 | Iron  | 666.0 | 333.0 | null  |
+-----+-------+-------+-------+-------+

所以我的问题是:这是要走的路吗?如果我的用户想要,比如我的表中的200个属性,我应该只进行200次自联接吗?考虑到表可以在数十亿行中增长,它是否可扩展?还有其他方法可以使用BigQuery吗?

感谢。

1 个答案:

答案 0 :(得分:7)

一般来说,超过50个连接的查询可能会开始出现问题,尤其是在您加入大型表时。即使有重复的字段,您也希望尽可能尝试一次扫描表格。

有用的是要注意,当您使用重复字段查询表时,您实际上是在查询该表的半展平表示。您可以假装每个重复都是自己的行,并相应地应用过滤器,表达式和分组。

在这种情况下,我认为你可以通过一次扫描逃脱:

select
  key,
  desc,
  max(if(lower(field.names) contains "trooper", field.value, null))
      within record as prop1,
  max(if(lower(field.names) contains "awesome", field.value, null))
      within record as prop2,
  ...
from mydataset.mytable

在这种情况下,每个“prop”字段只选择与每个所需字段名称对应的值,如果不存在则返回null,然后使用“max”函数聚合这些结果。我假设每个键只出现一个字段名称,在这种情况下,特定的聚合函数并不重要,因为它只存在以折叠空值。但显然你应该根据需要将它换成更合适的东西。

“记录内”语法告诉BigQuery仅在记录中的重复字段上执行这些聚合,而不是在整个表中执行这些聚合,从而消除了在末尾需要“group by”子句。