获取具有多列组标识符的最新预测数据

时间:2016-12-01 18:05:11

标签: sql postgresql greatest-n-per-group

我有许多带有风力预报数据的天气位置。我需要在前一天as_of之前的最新10:00。我每个小时,每个小时都需要这个。

位置定义为唯一的latlon对。

包含相关样本数据的完整表架构:

CREATE SCHEMA weather
CREATE TABLE weather.forecast
    (
    foretime timestamp without time zone NOT NULL,
  as_of timestamp without time zone NOT NULL, -- in UTC
  summary text,
  precipintensity numeric(8,4),
  precipprob numeric(2,2),
  temperature numeric(5,2),
  apptemp numeric(5,2),
  dewpoint numeric(5,2),
  humidity numeric(2,2),
  windspeed numeric(5,2),
  windbearing numeric(4,1),
  visibility numeric(5,2),
  cloudcover numeric(4,2),
  pressure numeric(6,2),
  ozone numeric(5,2),
  preciptype text,
  lat numeric(8,6) NOT NULL,
  lon numeric(9,6) NOT NULL,
  CONSTRAINT forecast_pkey PRIMARY KEY (foretime, as_of, lat, lon)
  );

INSERT INTO weather.forecast
    (windspeed, foretime, as_of, lat, lon)
VALUES
  (11.19,   '2/1/2016 8:00',    '1/30/2016 23:00', 34.556, 28.345),
  (10.98,   '2/1/2016 8:00',    '1/31/2016 5:00', 34.556, 28.345),
  (10.64,   '2/1/2016 8:00',    '1/31/2016 11:00', 34.556, 28.345),
  (10.95,   '2/1/2016 8:00',    '1/31/2016 8:00', 29.114, 16.277),
  (10.39,   '2/1/2016 8:00',    '1/31/2016 23:00', 29.114, 16.277),
  (9.22,    '2/1/2016 8:00',    '1/31/2016 5:00', 29.114, 16.277),
  (10,      '2/1/2016 9:00',    '1/30/2016 04:00', 34.556, 28.345),
  (9.88,    '2/1/2016 9:00',    '1/31/2016 09:00', 34.556, 28.345),
  (10.79,   '2/1/2016 9:00',    '1/30/2016 23:00', 34.556, 28.345),
  (10.8,    '2/1/2016 9:00',    '1/31/2016 5:00', 29.114, 16.277),
  (10.35,   '2/1/2016 9:00',    '1/31/2016 11:00', 29.114, 16.277),
  (10.07,   '2/1/2016 9:00',    '1/31/2016 17:00', 29.114, 16.277)
;

所需的结果格式:

lat      lon     Foredate   foreHE   windspeed       as_of  
34.556  28.345   2/1/2016     8       10.98     1/31/2016 5:00
34.556  28.345   2/1/2016     9        9.88     1/31/2016 9:00  
29.114  16.277   2/1/2016     8       10.95     1/31/2016 8:00  
29.114  16.277   2/1/2016     9       10.80     1/31/2016 5:00 

以下是获取正确as_of的代码。当我试图加入风速时,事情对我来说很糟糕。

SELECT        
    date_trunc('day', (a.foretime)) :: DATE AS Foredate,
        extract(HOUR FROM (a.foretime))    AS foreHE,
        a.lat,
        a.lon,
        max(a.as_of) - interval '5 hours' as latest_as_of
FROM weather.forecast a
WHERE date_trunc('day', foretime) :: DATE - as_of >= INTERVAL '14 hours'
GROUP BY Foredate, foreHE, a.lat, a.lon

1 个答案:

答案 0 :(得分:2)

Your error, when adding back windspeed, is this:

[42803] ERROR: column "a.windspeed" must appear in the GROUP BY clause or be used in an aggregate function
  Position: 184

I can't really improve on PostgreSQL's error message, except maybe to get into the theory a bit. Basically, when you do GROUP BY you're affording yourself the luxury acting over subsets within the larger set which is the table represented by the rest of the query. But SQL doesn't let you have those subsets to iterate over, you must tell the database your computations and let it give you back another flat list.

Of the two options presented by Postgres, usually one of them is an obvious choice. For instance, if you left out a.lon, it would be obvious that you're not grouping by longitude, just latitude, and you would agree that it should be added to the GROUP BY clause. But in this case if you group by actual measurements, every subset will have just one row, and that isn't useful either. So at first blush it seems like you need an aggregate. Problem number two is that there isn't an aggregate for you for this problem. Sigh!

So here's my thinking. The primary key you need to look up by is (forename, as_of, lat, lon) and you can get that straightforwardly with this query:

select
  foretime,
  max(as_of) as as_of,
  lat, lon
from weather.forecast
group by foretime, lat, lon;

Now you could join this back onto the same table, forecast, to get the most recent forecasts:

select
  date_trunc('day', a.foretime)::date as forecast_day,
  extract(hour from a.foretime) as forecast_hour,
  a.lat, a.lon,
  f.windspeed,
  a.as_of - interval '5 hours' as latest_as_of
from weather.forecast f
join (select
        foretime,
        max(as_of) as as_of,
        lat, lon
      from weather.forecast
      group by foretime, lat, lon) a using (foretime, as_of, lat, lon);

This produces the following report:

 forecast_day | forecast_hour |    lat    |    lon    | windspeed |    latest_as_of
--------------+---------------+-----------+-----------+-----------+---------------------
 2016-02-01   |             8 | 34.556000 | 28.345000 |     10.64 | 2016-01-31 06:00:00
 2016-02-01   |             8 | 29.114000 | 16.277000 |     10.39 | 2016-01-31 18:00:00
 2016-02-01   |             9 | 34.556000 | 28.345000 |      9.88 | 2016-01-31 04:00:00
 2016-02-01   |             9 | 29.114000 | 16.277000 |     10.07 | 2016-01-31 12:00:00
(4 rows)

There is probably a more efficient way to do this with correlated subqueries but I'm not sure how to accomplish it.

Edit: Matching your output format:

select
  a.lat, a.lon,
  date_trunc('day', a.foretime)::date as forecast_day,
  extract(hour from a.foretime) as forecast_hour,
  f.windspeed,
  a.as_of - interval '5 hours' as latest_as_of
from weather.forecast f
  join (select
          foretime,
          max(as_of) as as_of,
          lat, lon
        from weather.forecast
        where date_trunc('day', foretime)::date - as_of >= interval '14 hours'
        group by foretime, lat, lon) a using (foretime, as_of, lat, lon)
order by lat desc, lon;

Result:

    lat    |    lon    | forecast_day | forecast_hour | windspeed |    latest_as_of
-----------+-----------+--------------+---------------+-----------+---------------------
 34.556000 | 28.345000 | 2016-02-01   |             8 |     10.98 | 2016-01-31 00:00:00
 34.556000 | 28.345000 | 2016-02-01   |             9 |      9.88 | 2016-01-31 04:00:00
 29.114000 | 16.277000 | 2016-02-01   |             8 |     10.95 | 2016-01-31 03:00:00
 29.114000 | 16.277000 | 2016-02-01   |             9 |     10.80 | 2016-01-31 00:00:00
(4 rows)