BigQuery:如何在重复记录中插入新值?

时间:2018-11-05 11:00:53

标签: google-bigquery

我要保存用户状态的历史记录。

为此,我有一个包含两个列的表:user_identifier和status。

user_identifier是一个字符串,状态是具有key:value对:日期和状态的重复记录。

当用户更改状态(例如,从活动状态更改为非活动状态)时,我想更新此表并添加新状态,同时保留旧状态。

这是表模式:

[
{
"description": "user identifier",
"mode": "REQUIRED",
"name": "user_id",
"type": "STRING"
},
{
"description": "status - can be either sent or pending, initial state is pending",
"mode": "REPEATED",
"name": "status",
"type": "RECORD",
"fields": [
  {
  "name": "status_date",
  "type": "DATE",
  "mode": "REQUIRED"
  },
  {
  "name": "value",
  "type": "STRING",
  "mode": "REQUIRED"
  }
]
}
]

是否甚至可以在此架构中插入新的用户状态?我应该重新设计架构吗? 如何在BigQuery中正确地利用嵌套功能?

1 个答案:

答案 0 :(得分:1)

以下是用于BigQuery Standard SQL的数据,并假设您具有问题中所述的状态表project.dataset.statuses,并且具有更新表project.dataset.updates,在其中累积了用于状态表的定期更新的更新

因此伪数据可能看起来像

WITH `project.dataset.statuses` AS (
  SELECT 'a' user_id, [STRUCT<status_date DATE, value STRING>('2018-11-03', 'pending')] status UNION ALL
  SELECT 'b', [STRUCT<status_date DATE, value STRING>('2018-11-04', 'pending')] UNION ALL
  SELECT 'c', [] 
), `project.dataset.updates` AS (
  SELECT 'a' user_id, [STRUCT<status_date DATE, value STRING>('2018-11-05', 'sent')] new_statuses UNION ALL
  SELECT 'c', [STRUCT<status_date DATE, value STRING>('2018-11-05', 'pending')]
)

其中更新表具有完全相同的架构,并表示需要添加到主表的新更新

在SELECT之下,返回已连接状态

#standardSQL
SELECT 
  t.user_id, 
  IF(u.user_id IS NULL, status, ARRAY_CONCAT(status, new_statuses)) status
FROM `project.dataset.statuses` t
LEFT JOIN `project.dataset.updates` u
ON t.user_id = u.user_id   

您可以使用下面的DDL与它们“更新”状态表

#standardSQL
CREATE OR REPLACE TABLE `project.dataset.statuses` AS
SELECT 
  t.user_id, 
  IF(u.user_id IS NULL, status, ARRAY_CONCAT(status, new_statuses)) status
FROM `project.dataset.statuses` t
LEFT JOIN `project.dataset.updates` u
ON t.user_id = u.user_id   

如果要应用于虚拟数据

状态:

Row user_id status.status_date  status.value     
1   a       2018-11-03          pending  
2   b       2018-11-04          pending  
3   c             

更新:

Row user_id new_statuses.status_date    new_statuses.value   
1   a       2018-11-05          sent     
3   c       2018-11-05          pending  

结果将为

Row user_id status.status_date  status.value     
1   a       2018-11-03          pending  
            2018-11-05          sent     
2   b       2018-11-04          pending  
3   c       2018-11-05          pending    

如果updates表可以由尚未在主表中的新用户组成-以下将处理这种情况

#standardSQL
-- CREATE OR REPLACE TABLE `project.dataset.statuses` AS
SELECT 
  IFNULL(t.user_id, u.user_id) user_id,
  CASE 
    WHEN t.user_id = u.user_id THEN ARRAY_CONCAT(status, new_statuses)
    WHEN t.user_id IS NULL THEN new_statuses
    WHEN u.user_id IS NULL THEN status
  END status
FROM `project.dataset.statuses` t
FULL JOIN `project.dataset.updates` u
ON t.user_id = u.user_id