我拥有可以使用不同的json
键的数据,我想将所有这些数据存储在bigquery
中,然后在以后浏览可用字段。
我的结构如下:
[
{id: 1111, data: {a:27, b:62, c: 'string'} },
{id: 2222, data: {a:27, c: 'string'} },
{id: 3333, data: {a:27} },
{id: 4444, data: {a:27, b:62, c:'string'} },
]
我想使用STRUCT
类型,但似乎所有字段都需要声明?
然后,我希望能够查询并查看每个键出现的频率,并且基本上使用所有a
键对所有记录运行查询,就像它在自己的列中一样。
旁注:此数据来自URL查询字符串,也许有人认为最好推送完整的url并使用功能进行分析?
答案 0 :(得分:1)
您可以使用两种主要方法来存储半结构化数据:
选项1:存储JSON字符串
您可以将data
字段存储为JSON字符串,然后使用JSON_EXTRACT
函数提取它可以找到的值,并且它将为它返回的任何值返回NULL
找不到。
由于您提到需要对字段进行数学分析,因此我们对SUM
和a
的值进行简单的b
:
# Creating an example table using the WITH statement, this would not be needed
# for a real table.
WITH records AS (
SELECT 1111 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
UNION ALL
SELECT 2222 AS id, "{\"a\":27, \"c\": \"string\"}" as data
UNION ALL
SELECT 3333 AS id, "{\"a\":27}" as data
UNION ALL
SELECT 4444 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
)
# Example Query
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum FROM (
SELECT id,
CAST(JSON_EXTRACT(data, "$.a") AS INT64) AS aValue, # Extract & cast as an INT
CAST(JSON_EXTRACT(data, "$.b") AS INT64) AS bValue # Extract & cast as an INT
FROM records
)
# results
# Row | aSum | bSum
# 1 | 108 | 124
此方法有一些优点和缺点:
专业人士
缺点
选项2:重复字段
BigQuery具有support for repeated fields,可让您采用自己的结构并以SQL本机表示。
使用相同的示例,这是我们的操作方法:
## Using a with to create a sample table
WITH records AS (SELECT * FROM UNNEST(ARRAY<STRUCT<id INT64, data ARRAY<STRUCT<key STRING, value STRING>>>>[
(1111, [("a","27"),("b","62"),("c","string")]),
(2222, [("a","27"),("c","string")]),
(3333, [("a","27")]),
(4444, [("a","27"),("b","62"),("c","string")])
])),
## Using another WITH table to take records and unnest them to be joined later
recordsUnnested AS (
SELECT id, key, value
FROM records, UNNEST(records.data) AS keyVals
)
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum
FROM (
SELECT R.id, CAST(RA.value AS INT64) AS aValue, CAST(RB.value AS INT64) AS bValue
FROM records R
LEFT JOIN recordsUnnested RA ON R.id = RA.id AND RA.key = "a"
LEFT JOIN recordsUnnested RB ON R.id = RB.id AND RB.key = "b"
)
# results
# Row | aSum | bSum
# 1 | 108 | 124
如您所见,要执行类似操作,它仍然相当复杂。您还必须存储诸如字符串之类的项目,并在必要时CAST
将它们存储为其他值,因为您不能在重复的字段中混合类型。
专业人士
缺点
希望如此,祝您好运。