修改CTE的SQLite查询

时间:2016-01-16 01:44:50

标签: android sql sqlite common-table-expression recursive-cte

这是我的疑问:

    WITH desc_table(counter, hourly, current_weather_description, current_icons, time_stamp) AS (
Select count(*) AS counter, CASE WHEN  strftime('%M',  'now') < '30' 
                THEN strftime('%H', 'now')  
                ELSE strftime('%H', time_stamp, '+1 hours') END as hourly, 
                current_weather_description,
                current_icons,
                time_stamp
                From weather_events
                GROUP BY strftime('%H',  time_stamp, '+30 minutes'), current_weather_description
                UNION ALL
                Select count(*) as counter, hourly - 1, current_weather_description, current_icons, time_stamp
                From weather_events
                GROUP BY strftime('%H',  time_stamp, '+30 minutes'), current_weather_description
                Order By counter desc limit 1
                ),
        avg_temp_table(avg_temp, hour_seg, time_stamp) AS (
        select avg(current_temperatures) as avg_temp, CASE WHEN  strftime('%M',  time_stamp) < '30' 
                THEN strftime('%H', time_stamp)  
                ELSE strftime('%H', time_stamp, '+1 hours') END as hour_seg, 
                time_stamp
                from weather_events
                group by strftime('%H',  time_stamp, '+30 minutes')
                order by hour_seg desc
                )

                Select  hourly, current_weather_description
                from desc_table
                join avg_temp_table
                on desc_table.hourly=avg_temp_table.hour_seg

基本上我有一些天气数据,我将其分组为小时间隔(偏移30分钟),我想专门计算在该时间间隔内获得特定天气描述(和匹配图标)的次数,并选择天气在该时间间隔内具有最高出现次数(计数)(desc_table)的描述。然后我想获得该时间段内的平均温度((avg_temp_table)(也许我需要一个子查询?要做这个avg而不是我的方式)并在他们的小时列中加入两个查询。

我希望我的锚点基于查询的生成时间(现在)并计算出现次数,然后下一个成员每次减去一个小时并转到下一个时间间隔并计算等等。

示例数据,常规数据集{current_temperatures,current_weather_description,current_icons,time_stamp}的每个时间段内会有更多行:

"87"    "Rain"  "rainicon"  "2016-01-20 02:15:08"
"65"    "Snow"  "snowicon"  "2016-01-20 02:39:08"
"49"    "Rain"  "rainicon"  "2016-01-20 03:15:08"
"49"    "Rain"  "rainicon"  "2016-01-20 03:39:08"
"46"    "Clear" "clearicon" "2016-01-20 04:15:29"
"46"    "Clear" "clearicon" "2016-01-20 04:38:53"
"46"    "Cloudy" "cloudyicon" "2016-01-20 05:15:08"
"46"    "Clear" "clearicon" "2016-01-20 05:39:08"
"45"    "Clear" "clearicon" "2016-01-20 06:14:17"
"45"    "Clear" "clearicon" "2016-01-20 06:34:23"
"45"    "Clear" "clearicon" "2016-01-20 07:24:54"
"45"    "Rain"  "rainicon"  "2016-01-20 07:44:41"
"43"    "Rain"  "rainicon"  "2016-01-20 08:19:08"
"36"    "Clear" "clearicon" "2016-01-20 08:39:08"
"35"    "Meatballs" "meatballsicon" "2016-01-20 09:18:08"
"18"    "Cloudy" "cloudyicon" "2016-01-20 09:39:08"

输出是时间间隔(avg_temp_table)的平均温度与第一个聚合CTE(desc_table)的输出之间的连接{avg_temp,weather_description,current_icon}:

"87"    "Rain"  "rainicon"
"57"    "Rain"  "rainicon"
"47"    "Clear" "clearicon"
"46"    "Clear" "clearicon"
"46"    "Cloudy" "cloudyicon"
"45"    "Clear" "clearicon"
"44"    "Rain"  "rainicon"
"36"    "Clear" "clearicon"
"18"    "Cloudy" "cloudyicon"

现在我得到一个没有这样的列错误,因为我的锚来自我的weather_events表,我的递归成员也是如此。当我将递归成员更改为desc_table时,我得到“递归聚合查询不支持错误”。但是我不想从desc_table中获取递归成员,我希望按小时分段,然后遍历每小时间隔并获得计数。我猜测我也不正确地开始做锚。

1 个答案:

答案 0 :(得分:7)

我仍然不确定你的desc_table递归CTE应该如何选择最高的天气描述及其每小时的图标,但这很好,因为,使用你的口头描述,我想我已经想通了没有递归的方法。

首先,按小时和描述对结果进行分组,并计算每组中的行数:

SELECT
  strftime('%H', time_stamp, '+30 minutes') AS hour,
  current_weather_description,
  current_icons,
  COUNT(*) AS event_count
FROM
  weather_events
GROUP BY
  strftime('%H', time_stamp, '+30 minutes'),
  current_weather_description

下一步,按小时对上述查询的结果进行分组,并获得每小时最大事件数:

SELECT
  hour,
  MAX(event_count) AS max_event_count
FROM
  (
    SELECT
      strftime('%H', time_stamp, '+30 minutes') AS hour,
      current_weather_description,
      current_icons,
      COUNT(*) AS event_count
    FROM
      weather_events
    GROUP BY
      strftime('%H', time_stamp, '+30 minutes'),
      current_weather_description
  ) AS s
GROUP BY
  hour

这仍然不是您想要的,因为您实际上希望描述和图标匹配最大计数,而不是计数本身。好吧,这很容易修复 - 只需将这些列添加到SELECT 而不将它们添加到GROUP BY

SELECT
  hour,
  current_weather_description,
  current_icons,
  MAX(event_count) AS max_event_count
FROM
  (
    SELECT
      strftime('%H', time_stamp, '+30 minutes') AS hour,
      current_weather_description,
      current_icons,
      COUNT(*) AS event_count
    FROM
      weather_events
    GROUP BY
      strftime('%H', time_stamp, '+30 minutes'),
      current_weather_description
  ) AS s
GROUP BY
  hour

您仍然需要在查询中保留MAX(event_count)才能使用该技巧。它起作用的原因是因为在SQLite中,当SELECT语句包含单个MAX或单个MIN调用时,既不在GROUP BY中也不在聚合中的任何所选列的值将从与所述MAX或MIN值匹配的行中获取。 SQL的这种非标准扩展记录在release notes for SQLite 3.7.11

desc_table太多了。至于avg_temp_table CTE,你的当前方法似乎没有任何问题,除了我可能会使用GROUP BY表达式作为小时定义而不是你正在使用的CASE表达式,以保持一致性,{{1结果似乎也是多余的。因此略微修改的CTE将如下所示:

time_stamp

现在您只需要在SELECT strftime('%H', time_stamp, '+30 minutes') AS hour, AVG(current_temperatures) AS avg_temp FROM weather_events GROUP BY strftime('%H', time_stamp, '+30 minutes') 列上join the two sets并选择最终输出的相关列:

hour

所以你在这里。现在我想解决一个关于结果查询的问题,即

可以避免加入吗?

虽然您采用解决方案 - 分别获取描述和平均温度然后将两组连接在一起 - 很简单并且非常有意义,但是避免连接并同时进行所有计算会很好。这很可能会使查询更快,因为源只扫描一次。这可以实现吗?

碰巧,是的,它可以。组合这两个部分的主要困难在于,描述是通过两个步骤获得的,而平均温度的计算是单步操作。简单地将SELECT t.avg_temp, d.current_weather_description, d.current_icons FROM avg_temp_table AS t INNER JOIN desc_table AS d on t.hour = d.hour ORDER BY t.hour 放入第一个CTE的嵌套SELECT(按小时和描述分组)然后对外部SELECT(按小时分组)的结果进行AVG在数学上不等同于在整个小时组内进行一次AVG。

相反,您需要记住的是AVG = SUM / COUNT。如果在第一步中获得SUM和COUNT,然后在第二步中获得SUM和SUM的COUNT,则可以将第一个外部SUM除以第二个外部SUM以获得平均值。

这是新的AVG(current_temperatures) CTE被修改以组合查询的两个部分(因此不再应该是CTE而是complete query),并以粗体突出显示必要的更改:< / p>

desc_table

显然,SELECT SUM(total_temp) / SUM(event_count) AS avg_temp, current_weather_description, current_icons, MAX(event_count) AS max_event_count FROM ( SELECT strftime('%H', time_stamp, '+30 minutes') AS hour, current_weather_description, current_icons, COUNT(*) AS event_count, SUM(current_temperatures) AS total_temp FROM weather_events GROUP BY strftime('%H', time_stamp, '+30 minutes'), current_weather_description ) AS s GROUP BY hour ORDER BY hour ;列对输出来说是多余的 - 并且对于每组最大的N来说仍然至关重要&#34;查询依赖的方法。就个人而言,在这种情况下我不会担心一个冗余列,但是如果你有充分的理由将它排除在结果集之外,你可以使用上面的查询作为派生表(是的,再次)并拥有最外层的SELECT pull除了max_event_count之外的所有列 - 例如,像这样:

max_event_count

如您所见,中间层SELECT现在也包括SELECT avg_temp, current_weather_description, current_icons FROM ( SELECT hour, SUM(total_temp) / SUM(event_count) AS avg_temp, current_weather_description, current_icons, MAX(event_count) AS max_event_count FROM ( SELECT strftime('%H', time_stamp, '+30 minutes') AS hour, current_weather_description, current_icons, COUNT(*) AS event_count, SUM(current_temperatures) AS total_temp FROM weather_events GROUP BY strftime('%H', time_stamp, '+30 minutes'), current_weather_description ) AS s GROUP BY hour ) AS s ORDER BY hour desc ; ,这是最外层的ORDER BY所需要的。 (我在这里假设订单对于调用应用程序很重要。)

我只能提到两种方法的结果之间的差异。在第一个中,hour为您提供浮点结果。在第二个中,AVG(current_temperatures)给出一个整数。由于您的预期结果显示整数平均值,我想这应该不是问题。但是如果您以后决定要求更高的平均值精度,请记住,您可以使用TOTAL函数替换SUM(total_temp) / SUM(event_count)SUM(total_temp)中的SUM函数,该函数返回与SUM相同的值但是结果总是SUM(current_temperatures)。将real除以real会在SQLite中产生integer,因此使用TOTAL将获得与第一种方法中的AVG相同的结果。