计算具有多个条件的列的许多不同组合

时间:2016-06-08 01:56:37

标签: sql postgresql subquery query-optimization amazon-redshift

我有一个表格,其中包含名为fact_interactions的客户互动的运行历史记录。每次联系客户时,都会创建一条新记录,其中包含有关交互的特定详细信息。这是一个例子:

inter_id |customer_id |business_id |department_id |datetime_local      |outcome_id |
---------|------------|------------|--------------|--------------------|-----------|
46032383 |1           |112         |1916          |2015-01-14 19:54:20 |48         |
55740863 |2           |2           |3358          |2015-05-06 12:02:12 |19         |
49512895 |3           |160         |396           |2015-01-22 11:57:17 |19         |
51822751 |3           |160         |396           |2015-01-28 13:46:19 |19         |
23533190 |4           |132         |425           |2015-03-26 12:42:24 |19         |
69354240 |5           |164         |3061          |2015-03-30 11:01:43 |19         |
61417848 |5           |164         |3061          |2015-04-01 14:36:30 |19         |
74948424 |5           |164         |3061          |2015-04-28 15:12:42 |19         |
75303296 |5           |164         |3061          |2015-04-29 13:51:02 |10         |
76071776 |5           |164         |3061          |2015-05-01 09:18:39 |10         |

对于每条记录,我需要在多个时间窗口中找到多个条件匹配的所有行。以下是我正在使用的一些不同子查询的查询示例:

SELECT
    inter_id,
    (SELECT COUNT(*) FROM fact_interactions B
      WHERE B.customer_id = A.customer_id
      AND   B.business_id = A.business_id
      AND   B.department_id = A.department_id
      AND   B.datetime_local::date = A.datetime_local::date
      AND   B.datetime_local < A.datetime_local) AS cnt_samesamesame_day0
    (SELECT COUNT(*) FROM fact_interactions B
      WHERE B.customer_id = A.customer_id
      AND   B.business_id = A.business_id
      AND   B.department_id <> A.department_id
      AND   B.datetime_local::date = A.datetime_local::date
      AND   B.datetime_local < A.datetime_local) AS cnt_samesamediff_day0
    (SELECT COUNT(*) FROM fact_interactions B
      WHERE B.customer_id = A.customer_id
      AND   B.business_id <> A.business_id
      AND   B.department_id <> A.department_id
      AND   B.datetime_local::date = A.datetime_local::date
      AND   B.datetime_local < A.datetime_local) AS cnt_samediffdiff_day0
FROM fact_interactions A;

总共我有180个子查询用于我正在尝试计算的计数。因此,如果fact_interaction有1,000,000条记录,则输出也会有1,000,000条记录,但会有inter_id加180个计数列。以下是一些示例,说明这些180个计数子查询将被命名为进一步解释:

  • cnt_samesamesame_day0 /第3天/第7天/...
  • cnt_samesamediff_day0 /第3天/第7天/...
  • cnt_samediffdiff_day0 /第3天/第7天/...

查询能够完成,但正如您可以想象的那样,非常需要很长时间。只计算cnt_samesamesame_day0需要一分钟。

很难包含输出结果的样本,因为它非常稀疏。

有关如何更有效地执行此操作的任何建议?非常感谢具体的例子,但即使是更好的一般方法也会令人惊讶。谢谢!

(我正试图在Amazon Redshift群集上实现此功能)

1 个答案:

答案 0 :(得分:1)

我可能会建议您了解窗口功能。例如:

cout << static_cast<unsigned int>(newr) << " " << static_cast<unsigned int>(newg) << " " << static_cast<unsigned int>(newb) << endl; // Printing out characters
cout << static_cast<unsigned int>(oldr) << " " << static_cast<unsigned int>(oldg) << " " << static_cast<unsigned int>(oldb) << endl;

其他列可能有类似的构造。