分区的年度和小时中PostgreSQL总和(计数> 0的情况)的问题

时间:2018-07-17 06:51:00

标签: sql postgresql

使用PostgreSQL 9.4.18版

以下是一个查询,该查询返回non_zero_year_count和percent_years_count_not_zero的意外结果:

表格数据: 从1988-2018年开始,但在sqlfiddle中,测试数据库刚刚完成了2016-2018年的下表。 FallbackPolicy

CREATE TABLE ltg_data
("intensity" int, "time" timestamp with time zone, "lon" int, "lat" int)

(200, '2018-06-23 07:19:00', -122.109, 42.9446),
(200, '2018-06-24 07:19:00', -122.109, 42.9446),
(200, '2018-06-25 07:19:00', -122.109, 42.9446),
(200, '2018-06-26 07:19:00', -122.109, 42.9446),
(200, '2018-06-26 07:19:00', -122.109, 42.9446),
(200, '2018-06-24 07:19:00', -122.109, 42.9446),
(200, '2018-06-25 07:19:00', -122.109, 42.9446),
(200, '2018-06-26 07:19:00', -122.109, 42.9446),
(200, '2018-06-26 07:19:00', -122.109, 42.9446),
(200, '2018-06-24 07:19:00', -122.109, 42.9446),
(200, '2018-06-25 07:19:00', -122.109, 42.9446),
(200, '2018-06-26 07:19:00', -122.109, 42.9446),
(200, '2018-06-26 07:19:00', -122.109, 42.9446),
(200, '2018-06-24 07:19:00', -122.109, 42.9446),
(200, '2018-06-25 07:19:00', -122.109, 42.9446),
(200, '2018-06-26 07:19:00', -122.109, 42.9446),
(200, '2018-06-25 17:19:00', -122.109, 42.9446),
(200, '2018-06-25 17:19:00', -122.109, 42.9446),
(200, '2017-06-25 19:19:00', -122.109, 42.9446),
(200, '2017-06-25 20:19:00', -122.109, 42.9446),
(200, '2017-06-26 07:19:00', -122.109, 42.9446),
(200, '2017-06-26 07:19:00', -122.109, 42.9446),
(200, '2017-06-24 07:19:00', -122.109, 42.9446),
(200, '2017-06-24 07:19:00', -122.109, 42.9446),
(200, '2017-06-23 21:19:00', -122.109, 42.9446),
(200, '2017-06-23 21:19:00', -122.109, 42.9446),
(200, '2017-06-24 07:19:00', -122.109, 42.9446),
(200, '2017-06-24 07:19:00', -122.109, 42.9446),
(200, '2017-06-26 07:19:00', -122.109, 42.9446),
(200, '2017-06-26 07:19:00', -122.109, 42.9446),
(200, '2016-06-26 07:19:00', -122.109, 42.9446),
(200, '2016-06-25 07:19:00', -122.109, 42.9446),
(200, '2016-06-25 07:19:00', -122.109, 42.9446),
(200, '2016-06-27 07:19:00', -122.109, 42.9446),
(200, '2016-06-26 07:19:00', -122.109, 42.9446),
(200, '2016-06-26 07:19:00', -122.109, 42.9446)

因此,以下查询应返回一些有关表数据的基本统计信息。我认为,挑战在于尝试以一年中的几个小时和小时为单位进行划分,同时以某种方式合并年份。错误的数据涉及查询的一部分,该部分试图确定某年的某周和某小时(每小时)的计数> 0的年数。这是查询所使用的查询和功能(将标准化年份逐年纳入leap年的虚函数)。我正在使用“生成系列”,因为我希望获得一整年的价值,即使某个价值没有任何计数。

功能:

create or replace function IsLeapYear(int)
returns boolean as $$
select $1 % 4 = 0 and ($1 % 100 <> 0 or $1 % 400 = 0)
$$ LANGUAGE sql IMMUTABLE STRICT; 

create or replace function f_woyhh(timestamp with time zone)
returns int language plpgsql as $$
declare
currentYear int = extract (year from $1);
LeapYearShift int = 1 + (IsLeapYear(currentYear) and $1 > make_date  (currentYear, 2, 28))::int;
begin
return CONCAT(((extract(doy from $1)::int)- LeapYearShift) / 7+ 1, to_char   ($1, 'HH24'));
end;
$$;

查询:

WITH
CTE_Dates
AS
(
SELECT  f_woyhh(d) as dt


    ,EXTRACT(YEAR FROM d::timestamp) AS dtYear from
generate_series(timestamp '2016-01-01', timestamp '2018-12-31', interval '1 hour') as d
    -- full range of possible dates
)
,CTE_WeeklyHourlyCounts
AS
(
SELECT
f_woyhh(time) as dt
    ,time
    ,count(*) AS ct
FROM
    ltg_data
    GROUP BY ltg_data.time
)

,CTE_FullStats
AS
(
SELECT
    CTE_dates.dt as woyhh

    ,COUNT(DISTINCT CTE_Dates.dtYear)  AS years_count
    ,SUM(CASE WHEN CTE_WeeklyHourlyCounts.ct > 0 THEN 1 ELSE 0 END) OVER   (PARTITION BY CTE_Dates.dt) AS nonzero_year_count
,100.0 * SUM(CASE WHEN CTE_WeeklyHourlyCounts.ct > 0 THEN 1 ELSE 0 END)   OVER (PARTITION BY CTE_Dates.dt)
    / COUNT(DISTINCT CTE_Dates.dtYear) as percent_years_count_not_zero
FROM
    CTE_Dates
    LEFT JOIN CTE_WeeklyHourlyCounts ON CTE_WeeklyHourlyCounts.dt = CTE_Dates.dt
    GROUP BY CTE_dates.dt, CTE_WeeklyHourlyCounts.ct, CTE_WeeklyHourlyCounts.dt
    )

SELECT
woyhh
,nonzero_year_count
,years_count
,percent_years_count_not_zero
FROM
CTE_FullStats
WHERE woyhh::text like '26%'
    GROUP BY woyhh,   years_count, nonzero_year_count,     percent_years_count_not_zero
    ORDER BY  woyhh

意外结果:

woyhh | nonzero_year_count | years_count| percent_years_count_not_zero
2605  | 0                  | 3          | 0
2606  | 0                  | 3          | 0
2607  | 5                  | 3          | 200
2608  | 0                  | 3          | 0
2609  | 0                  | 3          | 0

不适用于2607的结果部分为nonzero_year_count,应为3,因为只有3年的数据,并且每个年份的第26周和第07小时都有计数(任何一天)该月24日之后的第26周)。另外,percent_years_count_not_zero应该是100%,而不是200%。 100%是最大期望的percent_years_count_not_zero。

所需结果:

woyhh | nonzero_year_count | years_count| percent_years_count_not_zero
2605  | 0                  | 3          | 0
2606  | 0                  | 3          | 0
2607  | 3                  | 3          | 100
2608  | 0                  | 3          | 0
2609  | 0                  | 3          | 0

所以我认为主要问题在于查询的这一部分:

,SUM(CASE WHEN CTE_WeeklyHourlyCounts.ct > 0 THEN 1 ELSE 0 END) OVER  (PARTITION BY CTE_Dates.dt) AS nonzero_year_count

如果我要分区,但这还不够,因为我需要考虑年份。就像我需要以某种方式对年份进行分组,以确定一年中是否发生过一次问题,然后将其视为该年份中的一年而已。我尝试合并年份,但遇到了更奇怪的结果。

我希望这可以澄清我的问题。我在下面添加了一个更新的sqlfiddle,以复制用于测试表的数据/查询。感谢您的帮助!

http://sqlfiddle.com/#!17/34289a/19

0 个答案:

没有答案