亲爱的所有
本月我开始使用BigQuery来分析GAE数据存储区中的数据。首先,我通过GAE控制台的“Datastore Admin”页面将数据导出到Google Cloud Storage。然后,我将数据从Google Cloud Storage导入BigQuery。除了重复的结构化属性外,它的工作非常顺利。我希望导入的记录格式应为:
parent:"James",
children: [{
name: "name1",
age: 5,
gender: "M"
}, {
name: "name2",
age: 50,
gender: "F"
}, {
name: "name3",
age: 33,
gender: "M"
},
]
我知道如何以上述格式压缩数据。但BigQuery中的实际数据格式似乎采用以下格式:
parent: "James",
children.name:["name1", "name2", "name3"],
children.age:[5, 50, 33],
children.gender:["M", "F", "M"],
我想知道是否可以在BigQuery中压缩上面的数据以进行进一步分析。理想的结果表格式是:
parentName, children.name, children.age, children.gender
James, name1, 5, "M"
James, name2, 50, "F"
James, name3, 33, "M"
干杯!
答案 0 :(得分:3)
最近推出BigQuery Standard SQL - 事情好多了!
请尝试以下(确保取消选中显示选项 下的Use Legacy SQL
复选框)
WITH parents AS (
SELECT
"James" AS parentName,
STRUCT(
["name1", "name2", "name3"] AS name,
[5, 50, 33] AS age,
["M", "F", "M"] AS gender
) AS children
)
SELECT
parentName, childrenName, childrenAge, childrenGender
FROM
parents,
UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name,
UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age,
UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender
WHERE
pos_name = pos_age AND pos_name = pos_gender
此处 - 原始表格 - parents
- 具有以下数据
各自schema
为
[{
"parentName": "James",
"children": {
"name": ["name1", "name2", "name3"],
"age": ["5", "50", "33" ],
"gender": ["M", "F", "M"]
}
}]
而output
是
注意:以上内容完全基于我在原始问题中看到的内容,并且很可能需要根据您的具体需求进行调整。 希望这有助于方向前进和从哪里开始!
添加了:
以上查询是使用基于行的CROSS JOINS,这意味着相同父级的所有变体首先汇编而不是WHERE子句过滤掉"错误"那些。
相反,在版本之下,使用INNER JOIN来消除这种"副作用"
WITH parents AS (
SELECT
"James" AS parentName,
STRUCT(
["name1", "name2", "name3"] AS name,
[5, 50, 33] AS age,
["M", "F", "M"] AS gender
) AS children
)
SELECT
parentName, childrenName, childrenAge, childrenGender
FROM
parents, UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name
JOIN UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age
ON pos_name = pos_age
JOIN UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender
ON pos_age = pos_gender
直观地说,我希望第二版对于更大的表更有效率
答案 1 :(得分:1)
您应该能够使用“大型查询结果”功能生成新的展平表格。不幸的是,语法很可怕。基本原则是您想要展平每个字段并保存位置,然后过滤位置相同的位置。 尝试类似:
SELECT parentName, children.name, children.age, children.gender,
position(children.name) as name_pos,
position(children.age) as age_pos,
position(children.gender) as gender_pos,
FROM table
SELECT
parent,
children.name,
children.age,
children.gender,
pos
FROM (
SELECT
parent,
children.name,
children.age,
children.gender,
gender_pos,
pos
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
POSITION(children.gender) as gender_pos
FROM (
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
POSITION(children.age) AS age_pos
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
POSITION(children.name) AS pos
FROM table
),
children.name))),
children.age))
WHERE
age_pos = pos)),
children.gender)))
WHERE
gender_pos = pos;
要允许大结果,如果您使用的是BigQuery UI,则应单击“高级选项”按钮,指定目标表,然后选中“允许大结果”标记。
请注意,如果您的数据存储为具有类似{name,age,gender}的嵌套记录的实体,我们应该将其转换为bigquery中的嵌套记录而不是并行数组。我会研究为什么会这样。