Vertica动态枢轴/变换

时间:2019-02-09 14:29:03

标签: sql vertica

我在vertica中有一张桌子:

id   Timestamp    Mask1       Mask2
------------------------------------------- 
 1    11:30         50         100
 1    11:35         52         101 
 2    12:00         53         102
 3    09:00         50         100
 3    22:10         52         105
 .     .            .           .
 .     .            .           .

我想转换为:

 id    rows     09:00    11:30    11:35     12:00     22:10     ....... 
-------------------------------------------------------------- 
 1     Mask1     Null     50       52       Null       Null     ....... 
       Mask2     Null     100      101      Null       Null     ....... 
 2     Mask1     Null     Null     Null     53         Null     .......     
       Mask2     Null     Null     Null     102        Null     .......
 3     Mask1     50       Null     Null     Null       52       .......     
       Mask2     100      Null     Null     Null       105      .......

点(...)表示我有很多记录。

  1. 时间戳记是一整天,其格式为小时:分钟:秒,从一天的00:00:00到24:00:00开始(我刚刚使用了小时:分钟来表示问题)。
  2. 我仅定义了两个额外的列Mask1和Mask2。我大约有200个Mask列可以使用。
  3. 我已经显示了5条记录,但实际上我大约有100万条记录。

到目前为止,我已经尝试过:

  1. 基于csv文件中的id转储每个记录。
  2. 在python熊猫中应用转置。
  3. 加入转置表。

可能的通用解决方案可能是在vertica(或UDTF)中应用,但是我对这个数据库还很陌生。

我在这种逻辑上挣扎了几天。谁能帮帮我吗。非常感谢。

2 个答案:

答案 0 :(得分:1)

您可以使用union all取消数据透视,然后进行条件聚合:

select id, which,
       max(case when timestamp >= '09:00' and timestamp < '09:30'  then mask end) as "09:00",
       max(case when timestamp >= '09:30' and timestamp < '10:00' then mask end) as "09:30",
       max(case when timestamp >= '10:00' and timestamp < '10:30' then mask end) as "10:00",
       . . .
from ((select id, timestamp,
              'Mask1' as which, Mask1 as mask
       from t
      ) union all
      (select id, timestamp, 'Mask2' as which, Mask2 as mask
       from t
      ) 
     ) t
group by t.id, t.which;

注意:这包括每行上的id。我强烈建议您这样做,但是您可以使用:

select (case when which = 'Mask1' then id end) as id

如果您真的想。

答案 1 :(得分:1)

下面是解决方案,因为我只针对数据示例中的时间值对其进行编码。

但是,如果您确实希望能够显示'00:00:00''23:59:59'的全部86400,则将无法显示。 Vertica的最大列数为1600。

但是,您可以使用Vertica函数TIME_SLICE(timestamp::TIMESTAMP,1,'MINUTE')::TIME

(TIME_SLICE将时间戳记作为输入并返回一个时间戳记,因此您必须前后转换(::),以将行数减少到1440 ...

无论如何,我将从SELECT DISTINCT timestamp FROM input ORDER BY 1;开始,然后在最终查询中,每个找到的时间戳会生成一行(希望它们不会超过1598 ....),就像实际用于您的数据的数据,进入您的查询:

, SUM(CASE timestamp WHEN '09:00' THEN val END) AS "09:00"
, SUM(CASE timestamp WHEN '11:30' THEN val END) AS "11:30"
, SUM(CASE timestamp WHEN '11:35' THEN val END) AS "11:35"
, SUM(CASE timestamp WHEN '12:00' THEN val END) AS "12:00"
, SUM(CASE timestamp WHEN '22:10' THEN val END) AS "22:10"

SQL通常没有任何给定查询的可变数量的输出列。如果最终列的数量因数据而异,则必须从数据生成最终查询,然后运行它。

欢迎使用SQL和关系数据库。

这是您数据的完整脚本。我首先沿“ Mask-n”列名称垂直旋转,然后沿时间戳水平旋转。

\pset null Null
-- ^ this is a vsql command to display nulls with the "Null" string
WITH 
-- your input, not in final query
input(id,Timestamp,Mask1,Mask2) AS (
          SELECT 1 ,  TIME '11:30'    ,    50    ,    100
UNION ALL SELECT 1 ,  TIME '11:35'    ,    52    ,    101
UNION ALL SELECT 2 ,  TIME '12:00'    ,    53    ,    102
UNION ALL SELECT 3 ,  TIME '09:00'    ,    50    ,    100
UNION ALL SELECT 3 ,  TIME '22:10'    ,    52    ,    105
)
,
-- real WITH clause starts here
-- need an index for your 200 masks
i(i) AS (
  SELECT MICROSECOND(ts) FROM (
            SELECT TIMESTAMPADD(MICROSECOND,  1,TIMESTAMP '2000-01-01') AS tm
  UNION ALL SELECT TIMESTAMPADD(MICROSECOND,200,TIMESTAMP '2000-01-01') AS tm
  )x
  TIMESERIES ts AS '1 MICROSECOND' OVER(ORDER BY tm)
)
,
-- verticalised masks
vertical AS (
  SELECT
    id
  , i
  , CASE i 
      WHEN   1 THEN 'Mask001' 
      WHEN   2 THEN 'Mask002' 
      WHEN 200 THEN 'Mask200' 
    END AS rows
  , timestamp
  , CASE i
      WHEN   1 THEN Mask1 
      WHEN   2 THEN Mask2 
      WHEN 200 THEN 0 -- no mask200 present
    END AS val
  FROM input CROSS JOIN i
  WHERE i <=2 -- only 2 masks present currently
)
-- test the vertical CTE ...
-- SELECT * FROM vertical order by id,rows,timestamp;
-- out  id | i |  rows   | timestamp | val 
-- out ----+---+---------+-----------+-----
-- out   1 | 1 | Mask001 | 11:30:00  |  50
-- out   1 | 1 | Mask001 | 11:35:00  |  52
-- out   1 | 2 | Mask002 | 11:30:00  | 100
-- out   1 | 2 | Mask002 | 11:35:00  | 101
-- out   2 | 1 | Mask001 | 12:00:00  |  53
-- out   2 | 2 | Mask002 | 12:00:00  | 102
-- out   3 | 1 | Mask001 | 09:00:00  |  50
-- out   3 | 1 | Mask001 | 22:10:00  |  52
-- out   3 | 2 | Mask002 | 09:00:00  | 100
-- out   3 | 2 | Mask002 | 22:10:00  | 105
SELECT
  id
, rows
, SUM(CASE timestamp WHEN '09:00' THEN val END) AS "09:00"
, SUM(CASE timestamp WHEN '11:30' THEN val END) AS "11:30"
, SUM(CASE timestamp WHEN '11:35' THEN val END) AS "11:35"
, SUM(CASE timestamp WHEN '12:00' THEN val END) AS "12:00"
, SUM(CASE timestamp WHEN '22:10' THEN val END) AS "22:10"
FROM vertical
GROUP BY
  id
, rows
ORDER BY
  id
, rows
;
-- out Null display is "Null".
-- out  id |  rows   | 09:00 | 11:30 | 11:35 | 12:00 | 22:10 
-- out ----+---------+-------+-------+-------+-------+-------
-- out   1 | Mask001 |  Null |    50 |    52 |  Null |  Null
-- out   1 | Mask002 |  Null |   100 |   101 |  Null |  Null
-- out   2 | Mask001 |  Null |  Null |  Null |    53 |  Null
-- out   2 | Mask002 |  Null |  Null |  Null |   102 |  Null
-- out   3 | Mask001 |    50 |  Null |  Null |  Null |    52
-- out   3 | Mask002 |   100 |  Null |  Null |  Null |   105
-- out (6 rows)
-- out 
-- out Time: First fetch (6 rows): 28.143 ms. All rows formatted: 28.205 ms