Self-referential CASE WHEN clause in SQL

时间:2016-04-04 18:51:38

标签: sql postgresql self-reference

I'm trying to migrate some poorly formed data into a database. The data comes from a CSV, and is first loaded into a staging table of all varchar columns (as I cannot enforce type safety at this stage).

The data might look like

COL1     | COL2 | COL3
Name 1   |      |     
2/11/16  | $350 | $230
2/12/16  | $420 | $387
2/13/16  | $435 | $727
Name 2   |      |     
2/11/16  | $121 | $144
2/12/16  | $243 | $658
2/13/16  | $453 | $214

The first colum is a mixture of company names as pseudo-headers, and dates for which colum 2 and 3 data is relevant. I'd like to start transforming the data by creating a 'Brand' column - where 'StoreBrand' is the value of Col1 if Col2 is NULL, or the previous row's StoreBrand otherwise. Comething like:

COL1     | COL2 | COL3 | StoreBrand
Name 1   |      |      | Name 1
2/11/16  | $350 | $230 | Name 1
2/12/16  | $420 | $387 | Name 1
2/13/16  | $435 | $727 | Name 1
Name 2   |      |      | Name 2
2/11/16  | $121 | $144 | Name 2
2/12/16  | $243 | $658 | Name 2
2/13/16  | $453 | $214 | Name 2

I wrote this:

SELECT 
    t.*,
    CASE
        WHEN t.COL2 IS NULL THEN COL1
        ELSE                     LAG(StoreBrand) OVER ()
    END AS StoreBrand
FROM
(
    SELECT
        ROW_NUMBER() OVER () AS i,
        *
    FROM
        Staging_Data
) t;

But the database (postgres in this case, but we're considering alternatives so the most diverse answer is preferred) chokes on LAG(StoreBrand) because that's the derived column I'm creating. Invoking LAG(Col1) only populates the first row's real data:

COL1     | COL2 | COL3 | StoreBrand
Name 1   |      |      | Name 1
2/11/16  | $350 | $230 | Name 1
2/12/16  | $420 | $387 | 2/11/16
2/13/16  | $435 | $727 | 2/12/16
Name 2   |      |      | Name 2
2/11/16  | $121 | $144 | Name 2
2/12/16  | $243 | $658 | 2/11/16
2/13/16  | $453 | $214 | 2/12/16

My goal would be a StoreBrand column which is the first value of COL1 for all date values before the next brand name:

COL1     | COL2 | COL3 | StoreBrand
Name 1   |      |      | Name 1
2/11/16  | $350 | $230 | Name 1
2/12/16  | $420 | $387 | Name 1
2/13/16  | $435 | $727 | Name 1
Name 2   |      |      | Name 2
2/11/16  | $121 | $144 | Name 2
2/12/16  | $243 | $658 | Name 2
2/13/16  | $453 | $214 | Name 2

The value of StoreBrand when Col2 and Col3 are null is inconsequential - that row will be dropped as part of the conversion process. The important thing is associating the data rows (i.e. those with dates) with their brand.

Is there a way to reference the previous value for the column that I'm missing?

2 个答案:

答案 0 :(得分:1)

编辑通过搜索引擎找到此问题的人:

诀窍是使用WITH,允许在多个地方使用临时结果(link)。

我认为这样做你想要的并同时丢弃空行(如果你愿意)。我们基本上在我们目前正在查看的行之前选择所有品牌,如果它与当前行之间没有“品牌行”,那么我们接受它。

WITH t AS
   (SELECT
      ROW_NUMBER() OVER () AS i,
      *
   FROM
      Staging_Data
   )
SELECT
   a.COL1,
   a.COL2,
   a.COL3,
   (SELECT b.COL1 FROM t b WHERE b.COL2 IS NULL AND b.i <= a.i AND NOT EXISTS(
      SELECT * FROM t c WHERE c.COL2 IS NULL AND c.i <= a.i AND c.i > b.i)
   ) StoreBrand
FROM
   t a
WHERE -- I don't think you need those rows? Otherwise remove it.
   a.COL2 IS NOT NULL

这可能有点令人困惑。 t是我们定义with您的查询的临时表。 abct的别名。FROM t AS aselect * from user where user.id = (select id from user_matching where id = user_matching_id) var styles = StyleSheet.create({ description: { marginBottom: 20, fontSize: 18, textAlign: 'center', color: '#656565' } image: { width: 217, height: 138 }, input: { // your style } }); public function questions() { return $this->hasMany('App\UserQuestion'); } 的别名。我们也可以写$users = User::all(); $users->each(function ($user) { $questions = User::find($user->id)->questions; }); 来使其更加明显。

答案 1 :(得分:0)

我想我明白你想要什么。从技术上讲,您需要ignore nulls上的lag()选项,因此它看起来像这样:

select lag(case when col1 not like '%/%/%' then col1 end ignore nulls) over (order by linenumber) as brandname

唯一的问题是什么? Postgres不支持ignore nulls

但是,你可以用子查询做同样的事情。我们的想法是为每个组分配一个分组标识符。这是有效品牌名称的累积计数。然后一个简单的max()聚合工作:

select t.*,
       max(case when col1 not like '%/%/%' then col1 end) over (partition by grp) as brand
from (select t.*,
             sum(case when col1 not like '%/%/%' then 1 end) over
                 (order by linenumber) as grp
      from t
     );