我有一个单独的级别表,按Person_ID和Date升序排列。在Person_ID级别有重复的条目。我想做的是在每个列中“向下填充”空值-我的印象是last_value(| ignore nulls)函数将对每个列都完美地工作。
一个主要问题是表格的宽度为数百列,并且非常动态(为ML实验创建功能)。必须有一个比为每个变量写一个last_value语句更好的方法,像这样:
SELECT last_value(var1) OVER (PARTITION BY Person_ID ORDER BY Date ASC
RANGE BETWEEN UNBOUNDED PRECEDING) as Var1,
last_value(var2) OVER (PARTITION BY Person_ID ORDER BY Date ASC
RANGE BETWEEN UNBOUNDED PRECEDING) as Var2,
...
last_value(var300) OVER (PARTITION BY Person_ID ORDER BY Date ASC
RANGE BETWEEN UNBOUNDED PRECEDING) as Var3
FROM TABLE
总结,我有下表:
+----------+-----------+------+------+---+------------+
| PersonID | YearMonth | Var1 | Var2 | … | Var300 |
+----------+-----------+------+------+---+------------+
| 1 | 200901 | 2 | null | | null |
| 1 | 200902 | null | 1 | | Category 1 |
| 1 | 201010 | null | 1 | | null |
+----------+-----------+------+------+---+------------+
并希望获得下表:
+----------+-----------+------+------+---+------------+
| PersonID | YearMonth | Var1 | Var2 | … | Var300 |
+----------+-----------+------+------+---+------------+
| 1 | 200901 | 2 | null | | null |
| 1 | 200902 | 2 | 1 | | Category 1 |
| 1 | 201010 | 2 | 1 | | Category 1 |
+----------+-----------+------+------+---+------------+
答案 0 :(得分:1)
我认为没有什么好的选择,但是您可能会考虑以下两种方法。
在这种方法中,您使用递归查询,其中每个子值等于自己,或者如果其为null,则返回其父级的值。像这样:
WITH
ordered AS (
SELECT yt.*
row_number() over ( partition by yt.personid order by yt.yearmonth ) rn
FROM YOUR_TABLE yt),
downfilled ( personid, yearmonth, var1, var2, ..., var300, rn) as (
SELECT o.*
FROM ordered o
WHERE o.rn = 1
UNION ALL
SELECT c.personid, c.yearmonth,
nvl(c.var1, p.var1) var1,
nvl(c.var2, p.var2) var2,
...
nvl(c.var300, p.var300) var300
FROM downfilled p INNER JOIN ordered c ON c.personid = p.personid AND c.rn = p.rn + 1 )
SELECT * FROM downfilled
ORDER BY personid, yearmonth;
这将替换每个表达式,如下所示:
last_value(var2) OVER (PARTITION BY Person_ID ORDER BY Date ASC
RANGE BETWEEN UNBOUNDED PRECEDING) as Var2
具有这样的表达式:
NVL(c.var2, p.var2)
但是,缺点是,这会使您重复两次300列的列表(一次用于300个NVL()
表达式,一次重复指定递归CTE(downfilled
)的输出列。
在这种方法中,您将UNPIVOT
列VARxx
排成行,因此只需要一次编写last_value()...
表达式。
SELECT personid,
yearmonth,
var_column,
last_value(var_value ignore nulls)
over ( partition by personid, var_column order by yearmonth ) var_value
FROM YOUR_TABLE
UNPIVOT INCLUDE NULLS ( var_value FOR var_column IN ("VAR1","VAR2","VAR3") ) )
SELECT * FROM unp
PIVOT ( max(var_value) FOR var_column IN ('VAR1' AS VAR1, 'VAR2' AS VAR, 'VAR3' AS VAR3 ) )
在这里,您仍然需要每列列出两次。另外,我不确定如果您拥有大量数据集,性能会如何。