在表格中填写缺少日期的数据(postgresql,redshift)

时间:2016-06-19 09:43:49

标签: postgresql amazon-web-services amazon-redshift gaps-in-data

我正在尝试填写缺失日期的每日数据,但无法找到答案,请帮忙。

我的daily_table示例:

      url          | timestamp_gmt | visitors | hits  | other.. 
-------------------+---------------+----------+-------+-------
 www.domain.com/1  | 2016-04-12    |   1231   | 23423 |
 www.domain.com/1  | 2016-04-13    |   1374   | 26482 |
 www.domain.com/1  | 2016-04-17    |   1262   | 21493 |
 www.domain.com/2  | 2016-05-09    |   2345   | 35471 |          

预期结果:我想用每个域的数据填充此表,并且每天仅复制先前date的数据:

      url          | timestamp_gmt | visitors | hits  | other.. 
-------------------+---------------+----------+-------+-------
 www.domain.com/1  | 2016-04-12    |   1231   | 23423 |
 www.domain.com/1  | 2016-04-13    |   1374   | 26482 |
 www.domain.com/1  | 2016-04-14    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-15    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-16    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-17    |   1262   | 21493 |
 www.domain.com/2  | 2016-05-09    |   2345   | 35471 |          

我可以将逻辑的一部分移到php中,但这是不可取的,因为我的表有数十亿的缺失日期。

摘要:

在最后几天,我发现了:

  1. Amazon-redshift适用于PostgreSql的第8个版本,这就是为什么它不支持像JOIN LATERAL这样漂亮的命令
  2. Redshift也不支持generate_seriesCTEs
  3. 但它支持简单WITH(谢谢@systemjack)但WITH RECURSIVE不支持

4 个答案:

答案 0 :(得分:2)

另一种解决方案,避免所有“现代”功能; - ]

func application(application: UIApplication, didFinishLaunchingWithOptions launchOptions: [NSObject: AnyObject]?) -> Bool {
    FIRApp.configure()

    // Override point for customization after application launch.
    window = UIWindow(frame: UIScreen.mainScreen().bounds)
    window?.makeKeyAndVisible()

    // Main view controller is inside of customtabbarcontroller, which gives a tab overlay
//        window?.rootViewController = CustomTabBarController()

    // Sets the main view to a storyboard element, such as SignInVC
    let storyboard = UIStoryboard(name: "SignIn", bundle: nil)
    let loginVC = storyboard.instantiateViewControllerWithIdentifier("SignInVC") as! SignInViewController
    self.window?.rootViewController = loginVC

    return true
}

答案 1 :(得分:1)

查看查询背后的想法:

select distinct on (domain, new_date) *
from (
    select new_date::date 
    from generate_series('2016-04-12', '2016-04-17', '1d'::interval) new_date
    ) s 
left join a_table t on date <= new_date
order by domain, new_date, date desc;

  new_date  |     domain      |    date    | visitors | hits  
------------+-----------------+------------+----------+-------
 2016-04-12 | www.domain1.com | 2016-04-12 |     1231 | 23423
 2016-04-13 | www.domain1.com | 2016-04-13 |     1374 | 26482
 2016-04-14 | www.domain1.com | 2016-04-13 |     1374 | 26482
 2016-04-15 | www.domain1.com | 2016-04-13 |     1374 | 26482
 2016-04-16 | www.domain1.com | 2016-04-13 |     1374 | 26482
 2016-04-17 | www.domain1.com | 2016-04-17 |     1262 | 21493
(6 rows)

您必须根据自己的要求选择开始日期和结束日期。 查询可能非常昂贵(您提到了数十亿的差距)因此请谨慎应用(对较小的数据子集进行测试或按阶段执行)。

如果没有generate_series(),您可以创建自己的生成器。 Here is an interesting example。可以使用引用文章中的视图代替generate_series()。例如,如果您需要句点'2016-04-12' + 5 days

select distinct on (domain, new_date) *
from (
    select '2016-04-12'::date+ n new_date
    from generator_16
    where n < 6
    ) s 
left join a_table t on date <= new_date
order by domain, new_date, date desc;

您将获得与第一个示例相同的结果。

答案 2 :(得分:1)

这是一个丑陋的黑客,在这种情况下使用日期来获取红移以在表中生成新行。此示例将输出限制为前30天。范围可以调整或删除。同样的方法也可用于分钟,秒等。

with days as (
    select (dateadd(day, -row_number() over (order by true), sysdate::date+'1 day'::interval)) as day
            from stv_blocklist limit 30
)
select day from days order by day

要定位特定的时间范围,请将sysdate更改为文字,这是您想要的范围结束后的最后一天,以及要覆盖的天数限制。

插入将是这样的:

with days as (
    select (dateadd(day, -row_number() over (order by true), sysdate::date+'1 day'::interval)) as day
            from stv_blocklist limit 30
)
insert into your_table (domain, date) (
    select dns.domain, d.day
    from days d
    cross join (select distinct(domain) from your_table) dns
    left join your_table y on y.domain=dns.domain and y.date=d.day
    where y.date is null
)

我无法测试插入内容,因此可能需要进行一些调整。

stv_blocklist表的引用可以是任何具有足够行的表,以涵盖with子句中的范围限制,并用于为row_number()窗口函数提供种子。

如果只有日期的行,您可以使用最新的完整记录更新它们,如下所示:

update your_table set visitors=t.visitors, hits=t.hits
from (
    select a.domain, a.date, b.visitors, b.hits
    from your_table a
    inner join your_table b
        on b.domain=a.domain and b.date=(SELECT max(date) FROM your_table where domain=a.domain and hits is not null and date < a.date)
    where a.hits is null
) t
where your_table.domain=t.domain and your_table.date=t.date

这很慢但是对于较小的数据集或一次性它应该没问题。我能够测试类似的查询。

更新:我认为填写空值的此版本的查询应该更好,并考虑域和日期。我测试了类似的版本。

update your_table set visitors=t.prev_visitors, hits=t.prev_hits
from (
    select domain, date, hits
        lag(visitors,1) ignore nulls over (partition by domain order by date) as prev_visitors,
        lag(hits,1) ignore nulls over (partition by domain order by date) as prev_hits
    from your_table
) t
where t.hits is null and your_table.domain=t.domain and your_table.date=t.date

应该可以将它与初始填充查询结合起来,并一次完成所有操作。

答案 3 :(得分:0)

最后,我完成了我的任务,我想分享一些有用的东西。

而不是generate_series我使用了这个钩子:

WITH date_range AS (
  SELECT trunc(current_date - (row_number() OVER ())) AS date
  FROM any_table  -- any of your table which has enough data
  LIMIT 365
) SELECT * FROM date_range;

要获取我必须用我使用的数据填写的URL列表:

WITH url_list AS (
  SELECT
    url AS gapsed_url,
    MIN(timestamp_gmt) AS min_date,
    MAX(timestamp_gmt) AS max_date
  FROM daily_table
  WHERE url IN (
    SELECT url FROM daily_table GROUP BY url
    HAVING count(url) < (MAX(timestamp_gmt) - MIN(timestamp_gmt) + 1)
  )
  GROUP BY url
) SELECT * FROM url_list;

然后我组合了给定的数据,我们称之为url_mapping

SELECT t1.*, t2.gapsed_url FROM date_range AS t1 CROSS JOIN url_list AS t2
WHERE t1.date <= t2.max_date AND t1.date >= t2.min_date;

为了按最近的日期获取数据,我做了以下工作:

SELECT sd.*
FROM url_mapping AS um JOIN daily_table AS sd
ON um.gapsed_url = sd.url AND (
  sd.timestamp_gmt = (SELECT max(timestamp_gmt) FROM daily_table WHERE url = sd.url AND timestamp_gmt <= um.date)
)

我希望它会帮助别人。