根据标准汇总文本

时间:2016-09-26 15:36:16

标签: sql postgresql aggregate-functions window-functions

previous question上,我问了一个类似的问题,这个问题依赖于辅助表作为分割数据标准的一部分。看来我目前的目标更容易,但我无法弄清楚。

鉴于表格:

CREATE TABLE conversations (id int, record_id int, is_response bool, text text);
INSERT INTO conversations VALUES
  (1,  1,  false, 'in text 1')
, (2,  1,  true , 'response text 1')
, (3,  1,  false, 'in text 2')
, (4,  1,  true , 'response text 2')
, (5,  1,  true , 'response text 3')
, (6,  2,  false, 'in text 1')
, (7,  2,  true , 'response text 1')
, (8,  2,  false, 'in text 2')
, (9,  2,  true , 'response text 2')
, (10, 2,  true , 'response text 3');

我想根据is_response值汇总文本并输出以下内容:

 record_id | aggregated_text                                   |
 ----------+---------------------------------------------------+
 1         |in text 1 response text 1                          |
 ----------+---------------------------------------------------+
 1         |in text 2 response text 2 response text 3          |
 ----------+---------------------------------------------------+
 2         |in text 1 response text 1                          |
 ----------+---------------------------------------------------+
 2         |in text 2 response text 2 response text 3          |

我已尝试过以下查询,但无法连续聚合两个响应,IE:is_response在序列中为真。

SELECT
    record_id,
    string_agg(text, ' ' ORDER BY id) AS aggregated_text
FROM (
    SELECT
        *,
        coalesce(sum(incl::integer) OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),0) AS grp
    FROM (
        SELECT *, is_response as incl
        FROM conversations
         ) c
     ) c1
GROUP BY record_id, grp
HAVING bool_or(incl)
ORDER BY max(id);

我的查询输出只为以下is_response行添加了另一行,如下所示:

 record_id | aggregated_text                                   |
 ----------+---------------------------------------------------+
 1         |in text 1 response text 1                          |
 ----------+---------------------------------------------------+
 1         |in text 2 response text 2                          |
 ----------+---------------------------------------------------+
 1         |response text 3                                    |
 ----------+---------------------------------------------------+
 2         |in text 1 response text 1                          |
 ----------+---------------------------------------------------+
 2         |in text 2 response text 2                          |
 ----------+---------------------------------------------------+
 2         | response text 3                                   |
 ----------+---------------------------------------------------+

我该如何解决?

2 个答案:

答案 0 :(得分:1)

这基本上是your previous question的简单版本。

SELECT record_id, string_agg(text, ' ') As context
FROM  (
   SELECT *, count(NOT is_response OR NULL) OVER (PARTITION BY record_id ORDER BY id) AS grp
   FROM   conversations
   ORDER  BY record_id, id
   ) sub
GROUP  BY record_id, grp
ORDER  BY record_id, grp;

使用子查询中的单个窗口函数生成完全所需的结果,然后进行聚合。

我对上一个问题的回答中的详细说明和链接:

答案 1 :(得分:1)

以下是我在answer中提供的previous question的变体:

SELECT record_id, string_agg(text, ' ')
FROM (
    SELECT *, coalesce(sum(incl::integer) OVER w,0) AS subgrp
    FROM (
        SELECT *, is_response AND NOT coalesce(lead(is_response) OVER w,false) AS incl
        FROM conversations
        WINDOW w AS (PARTITION BY record_id ORDER BY id)
    ) t
    WINDOW w AS (PARTITION BY record_id ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
) t1
GROUP BY record_id, subgrp
HAVING bool_or(incl)
ORDER BY min(id);

我们的想法是,对于每一行,我们在lead窗口函数的帮助下查看同一记录的下一行。如果没有这样的行,或者如果有一行,并且当前is_response为真时其is_response为假,那么我们选择该行,汇总所有先前未使用的text值。

此查询还确保如果最后一次会话不完整(在您的示例数据中没有发生),则会被省略。