Google BigQuery-从select语句执行动态生成的查询

时间:2018-07-17 23:27:14

标签: google-cloud-platform google-bigquery

在Google BigQuery中有一个具有以下结构的庞大表(> 1亿行):

名称|部门

abc | 1,2,5,6

xyz | 4,5

pqr | 3,4,6

想要将数据转换为以下格式:

名称| 1 | 2 | 3 | 4 | 5 | 6

abc | 1 | 1 | | | 1 | 1

xyz | | | | 1 | 1 |

pqr | | | 1 | 1 | | 1

到目前为止,通过使用CONCAT和REGEX_REPLACE函数,可以生成以这种格式准备数据集所需的查询:

    SELECT ' insert into dataset.output ( name, ' + 
  CONCAT(
      '_' , replace(departments,',',',_')  ) 

   + ' ) values(  \'' + name +'\','+ REGEXP_REPLACE(departments, "([^,\n]+)", "1") +')'
FROM (
select name, departments from dataset.input )

这将生成带有100 M插入查询的输出,这些查询可用于在所需结构中创建数据。

但是,下面是我的问题:

  1. 我们是否可以使用Big Query SQL直接执行此查询(100 M插入查询)的输出,还是需要逐个触发每个插入?

    < / li>
  2. 我相信没有办法在具有多个逗号分隔值的列中旋转或转置数据。是吗?

  3. 是否有使用BigQuery SQL而不编写自定义Java代码的最佳方法?

谢谢。

1 个答案:

答案 0 :(得分:2)

以下BigQuery标准SQL示例

   
#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'abc' name, '1,2,5,6' departments UNION ALL
  SELECT 'xyz', '4,5' UNION ALL
  SELECT 'pqr', '3,4,6' 
)
SELECT 
  name,
  IF(departments LIKE '%1%', 1, 0) AS d1,
  IF(departments LIKE '%2%', 1, 0) AS d2,
  IF(departments LIKE '%3%', 1, 0) AS d3,
  IF(departments LIKE '%4%', 1, 0) AS d4,
  IF(departments LIKE '%5%', 1, 0) AS d5,
  IF(departments LIKE '%6%', 1, 0) AS d6
FROM `project.dataset.table`   

结果为

Row name    d1  d2  d3  d4  d5  d6   
1   abc     1   1   0   0   1   1    
2   xyz     0   0   0   1   1   0    
3   pqr     0   0   1   1   0   1    

因此,您需要在目标上方运行到准备的任何新表中

请注意,以上假设您只有6个部门,最重要的是数字上没有歧义,例如1不会与10冲突
如果确实有这种情况,则需要在第

行以下进行转换
  IF(departments LIKE '%2%', 1, 0) AS d2,

进入

  IF(CONCAT(',', departments, ',') LIKE '%,2,%', 1, 0) AS d2 ...

当然,您只能使用一个简单的INSERT语句

INSERT `project.dataset.new_table` (name, d1, d2, d3, d4, d5, d6)    
SELECT 
  name,
  IF(departments LIKE '%1%', 1, 0) AS d1,
  IF(departments LIKE '%2%', 1, 0) AS d2,
  IF(departments LIKE '%3%', 1, 0) AS d3,
  IF(departments LIKE '%4%', 1, 0) AS d4,
  IF(departments LIKE '%5%', 1, 0) AS d5,
  IF(departments LIKE '%6%', 1, 0) AS d6
FROM `project.dataset.table`    

因此,所有这些的最后一点是: 而不是为原始表中的每一行生成INSERT STATEMENT-您应该生成执行“枢轴”的简单SELECT语句

  

“极限”更新,以最小化生成的代码

查看示例:

#standardSQL
CREATE TEMP FUNCTION c(departments STRING, department INT64) AS (
  IF(departments LIKE CONCAT('%',CAST(department AS STRING),'%'), 1, 0)
);
WITH `project.dataset.table` AS (
  SELECT 'abc' name, '1,2,5,6' departments UNION ALL
  SELECT 'xyz', '4,5' UNION ALL
  SELECT 'pqr', '3,4,6' 
), temp AS (
  SELECT name, departments AS d
  FROM `project.dataset.table`
)
SELECT 
name,
c(d,1)d1,
c(d,2)d2,
c(d,3)d3,
c(d,4)d4,
c(d,5)d5,
c(d,6)d6
FROM temp     

如您所见-现在您的10000行中的每行都将像c(d,N)dN,一样,最大长度为c(d,10000)d10000,,因此您就有机会适应查询大小限制