在Google BigQuery中有一个具有以下结构的庞大表(> 1亿行):
名称|部门
abc | 1,2,5,6
xyz | 4,5
pqr | 3,4,6
想要将数据转换为以下格式:
名称| 1 | 2 | 3 | 4 | 5 | 6
abc | 1 | 1 | | | 1 | 1
xyz | | | | 1 | 1 |
pqr | | | 1 | 1 | | 1
到目前为止,通过使用CONCAT和REGEX_REPLACE函数,可以生成以这种格式准备数据集所需的查询:
SELECT ' insert into dataset.output ( name, ' +
CONCAT(
'_' , replace(departments,',',',_') )
+ ' ) values( \'' + name +'\','+ REGEXP_REPLACE(departments, "([^,\n]+)", "1") +')'
FROM (
select name, departments from dataset.input )
这将生成带有100 M插入查询的输出,这些查询可用于在所需结构中创建数据。
但是,下面是我的问题:
我们是否可以使用Big Query SQL直接执行此查询(100 M插入查询)的输出,还是需要逐个触发每个插入?
< / li>我相信没有办法在具有多个逗号分隔值的列中旋转或转置数据。是吗?
是否有使用BigQuery SQL而不编写自定义Java代码的最佳方法?
谢谢。
答案 0 :(得分:2)
以下BigQuery标准SQL示例
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'abc' name, '1,2,5,6' departments UNION ALL
SELECT 'xyz', '4,5' UNION ALL
SELECT 'pqr', '3,4,6'
)
SELECT
name,
IF(departments LIKE '%1%', 1, 0) AS d1,
IF(departments LIKE '%2%', 1, 0) AS d2,
IF(departments LIKE '%3%', 1, 0) AS d3,
IF(departments LIKE '%4%', 1, 0) AS d4,
IF(departments LIKE '%5%', 1, 0) AS d5,
IF(departments LIKE '%6%', 1, 0) AS d6
FROM `project.dataset.table`
结果为
Row name d1 d2 d3 d4 d5 d6
1 abc 1 1 0 0 1 1
2 xyz 0 0 0 1 1 0
3 pqr 0 0 1 1 0 1
因此,您需要在目标上方运行到准备的任何新表中
请注意,以上假设您只有6个部门,最重要的是数字上没有歧义,例如1不会与10冲突
如果确实有这种情况,则需要在第
IF(departments LIKE '%2%', 1, 0) AS d2,
进入
IF(CONCAT(',', departments, ',') LIKE '%,2,%', 1, 0) AS d2 ...
当然,您只能使用一个简单的INSERT语句
INSERT `project.dataset.new_table` (name, d1, d2, d3, d4, d5, d6)
SELECT
name,
IF(departments LIKE '%1%', 1, 0) AS d1,
IF(departments LIKE '%2%', 1, 0) AS d2,
IF(departments LIKE '%3%', 1, 0) AS d3,
IF(departments LIKE '%4%', 1, 0) AS d4,
IF(departments LIKE '%5%', 1, 0) AS d5,
IF(departments LIKE '%6%', 1, 0) AS d6
FROM `project.dataset.table`
因此,所有这些的最后一点是: 而不是为原始表中的每一行生成INSERT STATEMENT-您应该生成执行“枢轴”的简单SELECT语句
“极限”更新,以最小化生成的代码
查看示例:
#standardSQL
CREATE TEMP FUNCTION c(departments STRING, department INT64) AS (
IF(departments LIKE CONCAT('%',CAST(department AS STRING),'%'), 1, 0)
);
WITH `project.dataset.table` AS (
SELECT 'abc' name, '1,2,5,6' departments UNION ALL
SELECT 'xyz', '4,5' UNION ALL
SELECT 'pqr', '3,4,6'
), temp AS (
SELECT name, departments AS d
FROM `project.dataset.table`
)
SELECT
name,
c(d,1)d1,
c(d,2)d2,
c(d,3)d3,
c(d,4)d4,
c(d,5)d5,
c(d,6)d6
FROM temp
如您所见-现在您的10000行中的每行都将像c(d,N)dN,
一样,最大长度为c(d,10000)d10000,
,因此您就有机会适应查询大小限制