Question

我正在尝试将重复记录的col类型从STRING更改为TIMESTAMP。这里有来自BQ文档的一些建议（manually-changing-schemas）。但是，我对每个推荐的建议都遇到了问题。

这是一个示例架构：

{
  'name' => 'id',
  'type' => 'STRING',
  'mode' => 'REQUIRED'
},
{
  'name' => 'name',
  'type' => 'STRING',
  'mode' => 'REQUIRED'
},
// many more fields including nested records and repeated records
{
  'name' => 'locations',
  'type' => 'RECORD',
  'mode' => 'REPEATED',
  'fields' => [
    {
      'name' => 'city',
      'type' => 'STRING',
      'mode' => 'REQUIRED'
    },
    {
      'name' => 'updated_at',
      'type' => 'STRING',   // ** want this as TIMESTAMP **
      'mode' => 'REQUIRED'
    },
  ]
}

使用查询的问题：

我认为我们必须取消重复的记录，为每个重复的记录将字段强制转换为时间戳，然后以某种方式重新创建要插入到新表中的行。

将表导出为JSON的问题：

以JSON格式导出表时，它将导出数据的原始JSON表示形式（具有地图和字典，正如我们期望的那样）。

但是，我们无法将原始数据导入回BQ：

BigQuery不支持JSON中的地图或字典。例如， “ product_categories”：{“ my_product”：40.0}无效，但是 “ product_categories”：{“ column1”：“ my_product”，“ column2”：40.0}是有效。

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#limitations

任何建议将不胜感激！

Answer 1

下面的答案基于：BigQuery StandardSQL中的REPEATED RECORD类型表示为ARRAY<STRUCT<f1 f1_type, f2 f2_type ... >>类型。

这不是我的最爱，因为您必须指定完整的列列表。也许有更好的方法。

#standardSQL
-- Build sample data, try to mimic what's in question.
CREATE OR REPLACE TABLE
  <your_dataset>.sample_table AS
SELECT name, 
       array<struct<city string, update_at string>>[("SFO", "2011-1-1"), ("SEA", "2022-2-2")] 
       as locations
FROM UNNEST(['Name1', "Name2", "Name3"]) as name;

然后，下面的SQL会将update_at列转换为DATE并保存到新表（如果需要，可以保存到同一表）。

#standardSQL
CREATE OR REPLACE TABLE
  <your_dataset>.output_table AS
SELECT * REPLACE (
   ARRAY(SELECT AS STRUCT * REPLACE(CAST(update_at AS DATE) AS update_at)
         FROM UNNEST(locations)) 
   AS locations 
   )
FROM
  <your_dataset>.sample_table;

如何更改BigQuery重复记录的col类型

1 个答案: