我需要帮助来嵌套一个具有多个数组并且在嵌套对象中重复同一字段的字段(数量在嵌套对象的内部和外部重复)。
数据集中有2个字段:order_id和cart,其中cart是具有多个列表的字典对象,其中包括列表“ items”中的列表,但是cart的数据类型为字符串。我希望输出是每个产品和类别的单独行。
使用部分工作的查询来采样数据。
#standardSQL
WITH t AS (
SELECT "order1234" as order_id, '{ "_id" : "cart1234" , "taxRate" : 0.0 , "items" : [{ "quantity" : 1 , "product" : { "_id" : "prod1" , "categoryIds" : [ "cat1", "cat2", "cat3"] , "name" : "Product 1" , "pricing" : { "listPrice" : { "value" : 899} , "salePrice" : { "value" : 725}}, "imagedata" : { "imageLink" : { "_id" : "img1" , "createdOn" : { "$date" : "2019-01-19T19:55:19.782Z"} , "revision" : 1} , "title" : "" , "description" : "" , "altText" : ""} , "variants" : [ ] , "productVariants" : [ { "_id" : "var1" , "sku" : { "value" : "sku1" , "modifier" : 0} , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" : { "sourceName" : null , "sourceId" : null , "sourceSku" : null , "sourceMethod" : null , "sourceRedirectUrl" : null , "sourceRedirectAppKey" : null }}] , "Shipping" : true}} , { "quantity" : 2 , "product" : { "_id" : "prod2" , "categoryIds" : [ "cat2", "cat4"] , "name" : "Product 2" , "pricing" : { "listPrice" : { "value" : 199} , "salePrice" : { "value" : 150}}, "imagedata" : { "imageLink" : { "_id" : "img2" , "createdOn" : { "$date" : "2019-01-19T19:58:11.484Z"} , "revision" : 1} , "title" : "" , "description" : "" , "altText" : ""} , "variants" : [ ] , "productVariants" : [ { "_id" : "var2" , "sku" : { "value" : "sku2" , "modifier" : 0} , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" : { "sourceName" : null , "sourceId" : null , "sourceSku" : null , "sourceMethod" : null , "sourceRedirectUrl" : null , "sourceRedirectAppKey" : null }}] , "Shipping" : true}} , { "quantity" : 3 , "product" : { "_id" : "prod3" , "categoryIds" : [ "cat2","cat4","cat5"] , "name" : "Product 3" , "pricing" : { "listPrice" : { "value" : 499} , "salePrice" : { "value" : 325}}, "imagedata" : { "imageLink" : { "_id" : "img3" , "createdOn" : { "$date" : "2019-01-15T05:34:17.556Z"} , "revision" : 3} , "title" : "" , "description" : "" , "altText" : ""} , "variants" : [ ] , "productVariants" : [ { "_id" : "var3" , "sku" : { "value" : "sku3" , "modifier" : 0} , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" : { "sourceName" : null , "sourceId" : null , "sourceSku" : null , "sourceMethod" : null , "sourceRedirectUrl" : null , "sourceRedirectAppKey" : null }}], "Shipping" : true }}]}' as cart
)
select order_id, quantity, product, JSON_EXTRACT_SCALAR(product,'$._id') as product_id, REPLACE(category_id, '"', '') category_id,
JSON_EXTRACT_SCALAR(product,'$.pricing.listPrice.value') as product_list_price,
JSON_EXTRACT_SCALAR(product,'$.pricing.salePrice.value') as product_sale_price
from t,
UNNEST(REGEXP_EXTRACT_ALL(cart, r'"categoryIds" : \[(.+?)]')) categoryIds WITH OFFSET pos1,
UNNEST(SPLIT(categoryIds)) category_id,
UNNEST(REGEXP_EXTRACT_ALL(cart, r'"product" : (.*?)\}')) product WITH OFFSET pos2,
UNNEST(REGEXP_EXTRACT_ALL(cart, r'"quantity" : (.+?)')) quantity WITH OFFSET pos3
where pos1= pos2 and pos1 = pos3
在上述查询中,数量字段不正确,并且product_list_price广告和product_sale_price未显示。请记住,在嵌套元素中重复进行数量。我认为我的正则表达式是错误的,因此我需要以某种方式在每个“项目”中选择第一个“数量”,并且就价格而言,我的产品正则表达式没有给我完整的产品词典,这就是为什么它们返回为null的原因。知道产品密钥内部可能有多个{}的情况下,要获得产品密钥的完整价值的正确正则表达式是什么?
预期结果
order_id quantity product_id category_id product_list_price product_sale_price
order1234 1 prod1 cat1 899 799
order1234 1 prod1 cat2 899 799
order1234 1 prod1 cat3 899 799
order1234 2 prod2 cat2 199 150
order1234 2 prod2 cat4 199 150
order1234 3 prod3 cat2 499 399
order1234 3 prod3 cat4 499 399
order1234 3 prod3 cat5 499 399
答案 0 :(得分:1)
要知道产品密钥内可能有多个{},要获得产品密钥的完整值的正确正则表达式是什么?
理想情况下,应该使用JSON_EXTRACT(而不是REGEXP_EXTRACT-这会使事情变得过于复杂)。但是很不幸,BigQuery的JSON_EXTRACT有一些限制,不允许处理JSON数组
要克服JsonPath的BigQuery“局限性”,可以使用custom function,如下例所示:
它使用的jsonpath-0.8.0.js可以从https://code.google.com/archive/p/jsonpath/downloads下载并上传到Google Cloud Storage-gs://your_bucket/jsonpath-0.8.0.js
#standardSQL
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return jsonPath(JSON.parse(json), json_path);
"""
OPTIONS (
library="gs://your_bucket/jsonpath-0.8.0.js"
);
SELECT order_id, quantity, product_id, category_id
FROM `project.dataset.table`,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].quantity')) quantity WITH OFFSET pos1,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].product._id')) product_id WITH OFFSET pos2,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].product.categoryIds')) category_ids WITH OFFSET pos3,
UNNEST(SPLIT(category_ids)) category_id
WHERE pos1 = pos2 AND pos1 = pos3
您可以使用您提供的示例数据进行测试,并进行上述操作:
#standardSQL
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return jsonPath(JSON.parse(json), json_path);
"""
OPTIONS (
library="gs://your_bucket/jsonpath-0.8.0.js"
);
WITH t AS (
SELECT "order1234" AS order_id, '''{ "_id" : "cart1234" , "taxRate" : 0.0 , "items" : [
{ "quantity" : 1 , "product" : { "_id" : "prod1" , "categoryIds" : [ "cat1", "cat2", "cat3"] , "name" : "Product 1" , "imagedata" : { "imageLink" : { "_id" : "img1" , "createdOn" : { "$date" : "2019-01-19T19:55:19.782Z"} , "revision" : 1} , "title" : "" , "description" : "" , "altText" : ""} , "variants" : [ ] , "productVariants" : [ { "_id" : "var1" , "sku" : { "value" : "sku1" , "modifier" : 0} , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" : { "sourceName" : null , "sourceId" : null , "sourceSku" : null , "sourceMethod" : null , "sourceRedirectUrl" : null , "sourceRedirectAppKey" : null }}] , "Shipping" : true}} ,
{ "quantity" : 2 , "product" : { "_id" : "prod2" , "categoryIds" : [ "cat2", "cat4"] , "name" : "Product 2" , "imagedata" : { "imageLink" : { "_id" : "img2" , "createdOn" : { "$date" : "2019-01-19T19:58:11.484Z"} , "revision" : 1} , "title" : "" , "description" : "" , "altText" : ""} , "variants" : [ ] , "productVariants" : [ { "_id" : "var2" , "sku" : { "value" : "sku2" , "modifier" : 0} , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" : { "sourceName" : null , "sourceId" : null , "sourceSku" : null , "sourceMethod" : null , "sourceRedirectUrl" : null , "sourceRedirectAppKey" : null }}] , "Shipping" : true}} ,
{ "quantity" : 3 , "product" : { "_id" : "prod3" , "categoryIds" : [ "cat2","cat4","cat5"] , "name" : "Product 3" , "imagedata" : { "imageLink" : { "_id" : "img3" , "createdOn" : { "$date" : "2019-01-15T05:34:17.556Z"} , "revision" : 3} , "title" : "" , "description" : "" , "altText" : ""} , "variants" : [ ] , "productVariants" : [ { "_id" : "var3" , "sku" : { "value" : "sku3" , "modifier" : 0} , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" : { "sourceName" : null , "sourceId" : null , "sourceSku" : null , "sourceMethod" : null , "sourceRedirectUrl" : null , "sourceRedirectAppKey" : null }}], "Shipping" : true }}
]}''' AS cart
)
SELECT order_id, quantity, product_id, category_id
FROM t,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].quantity')) quantity WITH OFFSET pos1,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].product._id')) product_id WITH OFFSET pos2,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].product.categoryIds')) category_ids WITH OFFSET pos3,
UNNEST(SPLIT(category_ids)) category_id
WHERE pos1 = pos2 AND pos1 = pos3
有结果
Row order_id quantity product_id category_id
1 order1234 1 prod1 cat1
2 order1234 1 prod1 cat2
3 order1234 1 prod1 cat3
4 order1234 2 prod2 cat2
5 order1234 2 prod2 cat4
6 order1234 3 prod3 cat2
7 order1234 3 prod3 cat4
8 order1234 3 prod3 cat5
请注意:示例数据中没有product_list_price
和product_sale_price
,因此上述结果中也不存在。但是现在查询非常干净和简单,因此希望您能够轻松添加这些