所以我正在使用以下postgresql表:
对于每个business_id,我想过滤掉那些review_count不高于连续2个月(或行)的特定review_count阈值的商家。取决于business_id所在的城市,阈值将是不同的(例如,在上面的屏幕截图中,我们可以假设具有city = Charlotte的行具有review_count阈值> = 2,而具有city = Las Vegas的行具有review_count阈值> = 3. 如果business_id没有至少一个连续月份的实例,且review_counts高于指定的阈值,我想将其过滤掉。
我希望此查询仅返回满足此条件的business_id(以及表中与该business_id一起的所有其他列)。此表上的复合主键是(business_id,year,month)。
您可能会注意到,有些月份数据中缺失(第二个business_id的第9个月)。如果是这种情况,我不想计算连续两个月的行数#2。例如,对于拉斯维加斯的业务,我不想将第8到第10个月视为连续几个月,即使它们出现在连续的行中。
我已经尝试过这样的事情,但是有点碰壁,不要以为它让我走得更远:
SELECT *
FROM us_business_monthly_review_growth
WHERE business_id IN (SELECT DISTINCT(business_id)
FROM us_business_monthly_review_growth
GROUP BY business_id, year, month
HAVING (city = 'Las Vegas'
AND (CASE WHEN COUNT(review_count >= 2 * 2.21) >= 2))
OR (city = 'Charlotte' AND (CASE WHEN COUNT(review_count >= 2 * 1.95) >= 2))
我是Postgre和StackOverflow的新手,所以如果您对我提出这个问题的方式有任何反馈,请不要犹豫,让我知道! =)
更新:
感谢@ Gordon Linoff的一些帮助,我找到了以下解决方案:
SELECT *
FROM us_businesses_monthly_growth_and_avg
WHERE business_id IN (SELECT distinct(business_id)
FROM (SELECT *,
lag(year) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_year,
lag(month) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_month,
lag(review_count) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_review_count
FROM us_businesses_monthly_growth_and_avg
) AS usga
WHERE (city = 'Charlotte' AND review_count >= 4 * 1.95 AND prev_review_count >= 4 * 1.95 AND (YEAR * 12 + month) = (prev_year * 12 + prev_month) + 1)
OR (city = 'Las Vegas' AND review_count >= 4 * 3.31 AND prev_review_count >= 4 * 3.31 AND (YEAR * 12 + month) = (prev_year * 12 + prev_month) + 1);
答案 0 :(得分:0)
您可以使用lag()
:
select distinct business_id
from (select t.*,
lag(year) over (partition by business_id order by year, month) as prev_year,
lag(month) over (partition by business_id order by year, month) as prev_month,
lag(rating) over (partition by business_id order by year, month) as prev_rating
from us_business_monthly_review_growth t
) t
where rating >= $threshhold and prev_rating >= $threshhold and
(year * 12 + month) = (prev_year * 12 + prev_month) + 1;
唯一的技巧是设置阈值。我不知道你打算怎么做。
答案 1 :(得分:0)
请尝试......
SELECT business_id
FROM
(
SELECT business_id AS business_id,
LAG( business_id, -1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
city,
LAG( year, -1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, -1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates,
review_count AS review_count
FROM us_business_monthly_review_growth
order BY business_id,
year,
month
) tempTable
JOIN tblCityThresholds ON tblCityThresholds.city = tempTable.city
WHERE business_id = lag_in_business_id
AND diffInDates = 1
AND tblCityThresholds.threshold <= review_count
GROUP BY business_id;
在制定这个答案时,我首先使用以下代码测试LAG()
按预期执行...
SELECT business_id,
LAG( business_id, 1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
year,
month,
LAG( year, 1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, 1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates
FROM mytable
ORDER BY business_id,
year,
month;
这里我试图让LAG()
引用下一行的值,但输出显示它指的是该比较中的前一行。不幸的是,我想将当前值与下一个值进行比较以查看下一条记录是否具有相同的business_id
等等。所以我将1
中的LAG()
更改为“-1”,我...
SELECT business_id,
LAG( business_id, -1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
year,
month,
LAG( year, -1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, -1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates
FROM mytable
ORDER BY business_id,
year,
month;
由于这给了我想要的结果,我添加了city,
以允许结果和假定表之间的JOIN
保存每个城市的详细信息及其相应的阈值。我选择名称tblCityThresholds
作为建议,因为我不确定你有什么/将它称之为。这完成了内部SELECT
语句。
然后我将内部SELECT
语句的结果加入tblCityThresholds
并根据您的条件细化输出。注意:假设city
字段始终在tblCityThresholds
中有相应的条目;
然后我使用GROUP BY
确保不重复business_id
。
如果您有任何问题或意见,请随时发表评论。
进一步阅读
https://www.postgresql.org/docs/8.4/static/functions-window.html(关于LAG()
)