在SQL中将分类列转换为二进制表示

时间:2017-02-05 06:18:37

标签: sql google-bigquery

考虑在包含分类数据的表中有一列字符串数组。是否有一种简单的方法来转换此模式,因此有number of categories个布尔列表示该分类列的二进制编码?

示例:

id      type
-------------
1       [A, C]
2       [B, C]

转换为:

id    is_A     is_B    is_C
1     1        0       1
2     0        1       1

我知道我可以手动执行此操作,即使用:

WITH flat AS (SELECT * FROM t, unnest(type) type),
mid AS (SELECT id, (type='A') as is_A, (type='B') AS is_B, (type='C') as is_C)
SELECT id, SUM(is_A), SUM(is_B), SUM(is_C) FROM mid GROUP BY id

但是我正在寻找一种解决方案,当类别数量在1-10K左右时可以使用 顺便说一下,我正在使用BigQuery SQL。

1 个答案:

答案 0 :(得分:2)

  

寻找在类别数量约为1-10K

时有效的解决方案

以下是BigQuery SQL

  

第1步 - 生成动态查询(类似于您的问题中使用的查询 - 但现在它是基于您的表格动态构建的 - yourTable

#standardSQL
WITH categories AS (SELECT DISTINCT cat FROM yourTable, UNNEST(type) AS cat)
SELECT CONCAT(
  "WITH categories AS (SELECT DISTINCT cat FROM yourTable, UNNEST(type) AS cat), ",
  "ids AS (SELECT DISTINCT id FROM yourTable), ",
  "pairs AS (SELECT id, cat FROM ids CROSS JOIN categories), ",
  "flat AS (SELECT id, cat FROM yourTable, UNNEST(type) cat), ",
  "combinations AS ( ",
  "  SELECT p.id, p.cat AS col, IF(f.cat IS NULL, 0, 1) AS flag ",
  "  FROM pairs AS p LEFT JOIN flat AS f ",
  "  ON p.cat = f.cat AND p.id=f.id ",
  ") ",
  "SELECT id, ",
  STRING_AGG(CONCAT("SUM(IF(col = '", cat, "', flag, 0)) as is_", cat) ORDER BY cat),
  " FROM combinations ",
  "GROUP BY id ",
  "ORDER BY id"
) as query
FROM categories  

第2步 - 复​​制上述查询的结果,将其粘贴回Web UI并运行查询

我觉得你有个主意。 Yo可以纯粹在SQL中实现它,或者您可以在您选择的任何客户端生成最终查询

  
    

我曾尝试过这种生成查询的方法(但是在Python中)问题是查询在BigQuery中很容易达到256KB的查询大小限制

  

首先,让我们看看达到256KB限制是多么“轻松” 假设您有10个字符作为平均类别长度 - 在这种情况下,您可以使用此方法覆盖大约4750个类别 平均为20 - 覆盖率约为3480,而30 - 2750

如果你通过删除空格和AS等来“压缩”sql,你可以分别进行: 5400,3800,2970分别为10,20,30个字符

所以,我会说 - 是/同意 - 在实际案例中,它很可能达到5K之前的限制

所以,其次,让我们看看这是否真的是一个大问题! 举个例子,假设您需要6K类别。让我们看看如何将其拆分为两批(假设3K方案确实按照初始解决方案工作)
我们需要做的是将类别拆分为两组 - 仅基于类别名称
所以第一组将是 - BETWEEN'cat1'和'cat3000' 第二组将是 - BETWEEN'cat3001'和'cat6000'

因此,现在使用步骤1和步骤2运行两个组,其中temp1和temp2表作为目标
在第1步 - 添加(到查询的最底部 - FROM categories

之后
WHERE cat BETWEEN ‘cat1’ AND ‘cat3000’   
第一批

WHERE cat BETWEEN ‘cat3001’ AND ‘cat6000’   

第二批

现在,继续执行第3步

第3步 - 结合部分结果

#standardSQL
SELECT * EXCEPT(id2)
FROM temp1 FULL JOIN (
  SELECT id AS id2, * EXCEPT(id) FROM temp2
) ON id = id2
-- ORDER BY id

您可以使用以下简单/虚拟数据

测试最后一个逻辑
WITH temp1 AS (
  SELECT 1 AS id, 1 AS is_A, 0 AS is_B UNION ALL   
  SELECT 2 AS id, 0 AS is_A, 1 AS is_B UNION ALL   
  SELECT 3 AS id, 1 AS is_A, 0 AS is_B    
),
temp2 AS (
  SELECT 1 AS id, 1 AS is_C, 0 AS is_D UNION ALL   
  SELECT 2 AS id, 1 AS is_C, 0 AS is_D UNION ALL
  SELECT 3 AS id, 0 AS is_C, 1 AS is_D    
)

以上可以很容易地扩展到不止两个批次

希望这有帮助