在大查询中重建从长到长(标准SQL)

时间:2017-12-05 10:02:31

标签: sql google-bigquery

不幸的是,在BQ中重塑它并不像在R中那么容易,我无法导出这个项目的数据。

这是输入

date    country A             B         C      D
20170928    CH  3000.3        121       13     3200
20170929    CH  2800.31       137       23     1614.31

预期输出

date    country Metric  Value  
20170928    CH  A       3000.3  
20170928    CH  B       121     
20170928    CH  C       13     
20170928    CH  D       3200
20170929    CH  A       2800.31 
20170929    CH  B       137       
20170929    CH  C       23     
20170929    CH  D       1614.31

我的表还有更多的列和行(但我假设需要很多手册)

3 个答案:

答案 0 :(得分:4)

下面是BigQuery Standard SQL,不需要重复选择取决于列数。它将选择尽可能多的数量并将其转换为指标和值

#standardSQL
SELECT DATE, country,
  metric, SAFE_CAST(value AS FLOAT64) value
FROM (
  SELECT DATE, country, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value 
  FROM `project.dataset.yourtable` t, 
  UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')

您可以像在问题中一样使用虚拟数据进行上述测试/播放

#standardSQL
WITH `project.dataset.yourtable` AS (
  SELECT '20170928' DATE, 'CH' country, 3000.3 A, 121 B, 13 C, 3200 D UNION ALL
  SELECT '20170929', 'CH', 2800.31, 137, 23, 1614.31
)
SELECT DATE, country,
  metric, SAFE_CAST(value AS FLOAT64) value
FROM (
  SELECT DATE, country, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value 
  FROM `project.dataset.yourtable` t, 
  UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(t), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('date', 'country')

结果符合预期

DATE        country metric  value    
20170928    CH      A       3000.3   
20170928    CH      B       121.0    
20170928    CH      C       13.0     
20170928    CH      D       3200.0   
20170929    CH      A       2800.31  
20170929    CH      B       137.0    
20170929    CH      C       23.0     
20170929    CH      D       1614.31  

答案 1 :(得分:2)

您需要使用bigquery中的逗号表示UNION

SELECT date, country, Metric, Value
FROM (
  SELECT date, country, 'A' as Metric,  A as Value FROM your_table
), (
  SELECT date, country, 'B' as Metric,  B as Value FROM your_table
), (
  SELECT date, country, 'C' as Metric,  C as Value FROM your_table
) , (
  SELECT date, country, 'D' as Metric,  D as Value FROM your_table
)

答案 2 :(得分:1)

我设法找到的大多数答案都需要指定要熔化的每个列的名称。当表中有成百上千的列时,这很难处理。这是一个适用于任意宽表的答案。

它使用动态SQL并自动从数据模式中提取多个列名称,整理命令字符串,然后评估该字符串。旨在模仿Python pandas.melt()/ R reshape2 :: melt()行为。

由于UDF的某些不良特性,我故意没有创建用户定义的函数。根据您的使用方式,您可能会或可能不想这样做。

输入:

id0 id1 _2020_05_27 _2020_05_28
1   1   11          12
1   2   13          14
2   1   15          16
2   2   17          18

输出:

id0 id1 date         value
1   2   _2020_05_27  13
1   2   _2020_05_28  14
2   2   _2020_05_27  17
2   2   _2020_05_28  18
1   1   _2020_05_27  11
1   1   _2020_05_28  12
2   1   _2020_05_27  15
2   1   _2020_05_28  16
#standardSQL

-- PANDAS MELT FUNCTION IN GOOGLE BIGQUERY
-- author: Luna Huang
-- email: lunahuang@google.com

-- run this script with Google BigQuery Web UI in the Cloud Console

-- this piece of code functions like the pandas melt function
-- pandas.melt(id_vars, value_vars, var_name, value_name, col_level=None)
-- without utilizing user defined functions (UDFs)
-- see below for where to input corresponding arguments

DECLARE cmd STRING;
DECLARE subcmd STRING;
SET cmd = ("""
  WITH original AS (
    -- query to retrieve the original table
    %s
  ),
  nested AS (
    SELECT
    [
      -- sub command to be automatically generated
      %s
    ] as s,
    -- equivalent to id_vars in pandas.melt()
    %s,
    FROM original
  )
  SELECT
    -- equivalent to id_vars in pandas.melt()
    %s,
    -- equivalent to var_name in pandas.melt()
    s.key AS %s,
    -- equivalent to value_name in pandas.melt()
    s.value AS %s,
  FROM nested
  CROSS JOIN UNNEST(nested.s) AS s
""");
SET subcmd = ("""
  WITH
  columns AS (
    -- query to retrieve the column names
    -- equivalent to value_vars in pandas.melt()
    -- the resulting table should have only one column
    -- with the name: column_name
    %s
  ),
  scs AS (
    SELECT FORMAT("STRUCT('%%s' as key, %%s as value)", column_name, column_name) AS sc
    FROM columns
  )
  SELECT ARRAY_TO_STRING(ARRAY (SELECT sc FROM scs), ",\\n")
""");

-- -- -- EXAMPLE BELOW -- -- --

-- SET UP AN EXAMPLE TABLE --
CREATE OR REPLACE TABLE `tmp.example`
(
  id0 INT64,
  id1 INT64,
  _2020_05_27 INT64,
  _2020_05_28 INT64,
);
INSERT INTO `tmp.example` VALUES (1, 1, 11, 12);
INSERT INTO `tmp.example` VALUES (1, 2, 13, 14);
INSERT INTO `tmp.example` VALUES (2, 1, 15, 16);
INSERT INTO `tmp.example` VALUES (2, 2, 17, 18);

-- MELTING STARTS --
-- execute these two command to melt the table

-- the first generates the STRUCT commands
-- and saves a string in subcmd
EXECUTE IMMEDIATE FORMAT(
  -- please do not change this argument
  subcmd,
  -- query to retrieve the column names
  -- equivalent to value_vars in pandas.melt()
  -- the resulting table should have only one column
  -- with the name: column_name
  """
    SELECT column_name
    FROM `tmp.INFORMATION_SCHEMA.COLUMNS`
    WHERE (table_name = "example") AND (column_name NOT IN ("id0", "id1"))
  """
) INTO subcmd;

-- the second implements the melting
EXECUTE IMMEDIATE FORMAT(
  -- please do not change this argument
  cmd,
  -- query to retrieve the original table
  """
    SELECT *
    FROM `tmp.example`
  """,
  -- please do not change this argument
  subcmd,
  -- equivalent to id_vars in pandas.melt()
  -- !!please type these twice!!
  "id0, id1", "id0, id1",
  -- equivalent to var_name in pandas.melt()
  "date",
  -- equivalent to value_name in pandas.melt()
  "value"
);