将嵌套的JSON字符串展平到Google BigQuery中的不同列

时间:2019-03-19 15:37:08

标签: python json pandas google-bigquery

我在BigQuery表之一中有一个看起来像这样的列。

{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}

有没有要在GBQ中获得像这样的输出? (基本上将整个列展平为不同的列)

name    last_delivered.push_id   last_delivered.time   session_id   source   properties.UserId

name1       push_id1                     time1         session_id1   SDK          uid1

  

a = {“ name”:“ name1”,“ last_delivered”:{“ push_id”:“ push_id1”,   “ time”:“ time1”},“ session_id”:“ session_id1”,“ source”:“ SDK”,   “ properties”:{“ UserId”:“ u1”}}

我尝试使用 json_normalize(a)在Pandas Python中获得所需的输出,但每次尝试出现以下错误

enter image description here

任何人都不知道如何获得所需的输出。我错过了什么吗?

任何帮助将不胜感激!

2 个答案:

答案 0 :(得分:2)

以下示例适用于BigQuery标准SQL

#standardSQL
WITH `project.dataset.table` AS (
  SELECT '{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}' col
)
SELECT 
  JSON_EXTRACT_SCALAR(col, '$.name') name,
  STRUCT(
    JSON_EXTRACT_SCALAR(col, '$.last_delivered.push_id') AS push_id,
    JSON_EXTRACT_SCALAR(col, '$.last_delivered.time') AS time
  ) last_delivered,
  JSON_EXTRACT_SCALAR(col, '$.session_id') session_id,
  JSON_EXTRACT_SCALAR(col, '$.source') source,
  STRUCT(
    JSON_EXTRACT_SCALAR(col, '$.properties.UserId') AS UserId
  ) properties
FROM `project.dataset.table`   

并按预期/要求产生结果

Row name    last_delivered.push_id  last_delivered.time session_id  source  properties.UserId    
1   name1   push_id1                time1               session_id1 SDK     u1     

答案 1 :(得分:2)

我对为什么它不起作用的猜测是您的json数据实际上是一个字符串:

from pandas.io.json import json_normalize 

a = '''{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}'''  

df = json_normalize(a)

输出:

AttributeError: 'str' object has no attribute 'values'    

对:

from pandas.io.json import json_normalize 

a = {"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}  

df = json_normalize(a)

输出:

 print(df.to_string())
  last_delivered.push_id last_delivered.time   name properties.UserId   session_id source
0               push_id1               time1  name1                u1  session_id1    SDK

在这种情况下,可以在规范化之前使用json.loads()

import json
from pandas.io.json import json_normalize

a = '''{"name": "name1", "last_delivered": {"push_id": "push_id1", "time": "time1"}, "session_id": "session_id1", "source": "SDK", "properties": {"UserId": "u1"}}'''  

data = json.loads(a)
df = json_normalize(data)