BigQuery:如何压缩从数据存储区导入的重复结构化属性

时间:2013-06-21 05:29:45

标签: google-app-engine google-cloud-datastore google-bigquery

亲爱的所有

本月我开始使用BigQuery来分析GAE数据存储区中的数据。首先,我通过GAE控制台的“Datastore Admin”页面将数据导出到Google Cloud Storage。然后,我将数据从Google Cloud Storage导入BigQuery。除了重复的结构化属性外,它的工作非常顺利。我希望导入的记录格式应为:

    parent:"James",
    children: [{
        name: "name1",
        age: 5,
        gender: "M"
      }, {
        name: "name2",
        age: 50,
        gender: "F"
      }, {
        name: "name3",
        age: 33,
        gender: "M"
      },
    ]

我知道如何以上述格式压缩数据。但BigQuery中的实际数据格式似乎采用以下格式:

    parent: "James",
    children.name:["name1", "name2", "name3"],
    children.age:[5, 50, 33],
    children.gender:["M", "F", "M"],    

我想知道是否可以在BigQuery中压缩上面的数据以进行进一步分析。理想的结果表格式是:

    parentName, children.name, children.age, children.gender
    James, name1, 5, "M"
    James, name2, 50, "F"
    James, name3, 33, "M"      

干杯!

2 个答案:

答案 0 :(得分:3)

最近推出BigQuery Standard SQL - 事情好多了! 请尝试以下(确保取消选中显示选项 下的Use Legacy SQL复选框)

WITH parents AS (
  SELECT 
    "James" AS parentName,
    STRUCT(
      ["name1", "name2", "name3"] AS name,
      [5, 50, 33] AS age,
      ["M", "F", "M"] AS gender
    ) AS children    
)
SELECT 
  parentName, childrenName, childrenAge, childrenGender
FROM 
  parents, 
  UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name,
  UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age, 
  UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender
WHERE
  pos_name = pos_age AND pos_name = pos_gender

此处 - 原始表格 - parents - 具有以下数据

enter image description here

各自schema

[{
    "parentName": "James",
    "children": {
      "name": ["name1", "name2", "name3"],
      "age": ["5", "50", "33" ],
      "gender": ["M", "F", "M"]
    }
}]

output

enter image description here

注意:以上内容完全基于我在原始问题中看到的内容,并且很可能需要根据您的具体需求进行调整。 希望这有助于方向前进和从哪里开始!

  

添加了:

以上查询是使用基于行的CROSS JOINS,这意味着相同父级的所有变体首先汇编而不是WHERE子句过滤掉"错误"那些。

相反,在版本之下,使用INNER JOIN来消除这种"副作用"

WITH parents AS (
  SELECT 
    "James" AS parentName,
    STRUCT(
      ["name1", "name2", "name3"] AS name,
      [5, 50, 33] AS age,
      ["M", "F", "M"] AS gender
    ) AS children   
)
SELECT 
  parentName, childrenName, childrenAge, childrenGender
FROM 
  parents, UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name
JOIN UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age 
  ON pos_name = pos_age
JOIN UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender 
  ON pos_age = pos_gender

直观地说,我希望第二版对于更大的表更有效率

答案 1 :(得分:1)

您应该能够使用“大型查询结果”功能生成新的展平表格。不幸的是,语法很可怕。基本原则是您想要展平每个字段并保存位置,然后过滤位置相同的位置。 尝试类似:

SELECT parentName, children.name, children.age, children.gender, 
  position(children.name) as name_pos,
  position(children.age) as age_pos,
  position(children.gender) as gender_pos, 
    FROM table
SELECT
  parent,
  children.name,
  children.age,
  children.gender,
  pos
FROM (
  SELECT
    parent,
    children.name,
    children.age,
    children.gender,
    gender_pos,
    pos
  FROM (
      FLATTEN((
        SELECT
          parent,
          children.name,
          children.age,
          children.gender,
          pos,
          POSITION(children.gender) as gender_pos
        FROM (
          SELECT
            parent,
            children.name,
            children.age,
            children.gender,
            pos,              
          FROM (
              FLATTEN((
                SELECT
                  parent,
                  children.name,
                  children.age,
                  children.gender,
                  pos,
                  POSITION(children.age) AS age_pos
                FROM (
                    FLATTEN((
                      SELECT
                        parent,     
                        children.name,
                        children.age,
                        children.gender,
                        POSITION(children.name) AS pos
                      FROM table
                        ),
                      children.name))),
                children.age))
          WHERE
            age_pos = pos)),
        children.gender)))
WHERE
  gender_pos = pos;

要允许大结果,如果您使用的是BigQuery UI,则应单击“高级选项”按钮,指定目标表,然后选中“允许大结果”标记。

请注意,如果您的数据存储为具有类似{name,age,gender}的嵌套记录的实体,我们应该将其转换为bigquery中的嵌套记录而不是并行数组。我会研究为什么会这样。