最有效的方式加入两个时间序列

时间:2018-01-04 21:39:03

标签: python sql postgresql amazon-redshift

想象一下,我有一张这样的表:

+

使用这样的值:

FixSet['reeeee'] = FixSet['home_phone__c'].replace('\[^0-9\]+?.-', '', regex=True)

我希望能够做到这一点:

 CREATE TABLE time_series (
        snapshot_date DATE,
        sales INTEGER,
PRIMARY KEY (snapshot_date));

产生如下结果:

INSERT INTO time_series SELECT '2017-01-01'::DATE AS snapshot_date,10 AS sales;
INSERT INTO time_series SELECT '2017-01-02'::DATE AS snapshot_date,4 AS sales;
INSERT INTO time_series SELECT '2017-01-03'::DATE AS snapshot_date,13 AS sales;
INSERT INTO time_series SELECT '2017-01-04'::DATE AS snapshot_date,7 AS sales;
INSERT INTO time_series SELECT '2017-01-05'::DATE AS snapshot_date,15 AS sales;
INSERT INTO time_series SELECT '2017-01-06'::DATE AS snapshot_date,8 AS sales;

使用一些微不足道的行,就像在这个例子中一样,查询运行得非常快。问题是我必须为数百万行执行此操作,而在Redshift上(类似于Postgres的语法),我的查询需要数天才能运行。它非常慢,但这是我最常见的查询模式之一。我怀疑问题是由于数据中O(n ^ 2)的增长与更优选的O(n)的增长。

我在Python中的O(n)实现将是这样的:

SELECT a.snapshot_date, 
       AVG(b.sales) AS sales_avg,
       COUNT(*) AS COUNT
  FROM time_series AS a
  JOIN time_series AS b
       ON a.snapshot_date > b.snapshot_date
 GROUP BY a.snapshot_date

使用这样的结果(与SQL查询相同):

*---------------*-----------*-------*
| snapshot_date | sales_avg | count |
*---------------*-----------*-------*
|  2017-01-02   |   10.0    |    1  |
|  2017-01-03   |   7.0     |    2  |
|  2017-01-04   |   9.0     |    3  |
|  2017-01-05   |   8.5     |    4  |
|  2017-01-06   |   9.8     |    5  |
-------------------------------------

我考虑切换到Apache Spark,这样我就可以完成那个python查询,但是几百万行并不是那么大(它最多只有3-4 GB)使用具有100 GB RAM的Spark群集似乎有点矫枉过正。是否有一种高效且易于阅读的方式我可以在SQL中获得O(n)效率,最好是在Postgres / Redshift中?

1 个答案:

答案 0 :(得分:5)

你似乎想要:

nullptr

您会发现使用窗口函数效率更高。