我有一个MS SQL Server 2008数据库,我存储了供应食物的地方(咖啡馆,餐馆,食客等)。在连接到此数据库的网站上,人们可以按1到3的等级对地点进行评级。
在网站上有一个页面,人们可以在这个页面中查看某个城市中排名前25位(评分最高)的名单。数据库结构看起来像这样(表中存储了更多信息,但这里是相关信息):
一个地方位于一个城市,投票放在一个地方。
到目前为止,我刚刚计算了每个地方的平均投票得分,我将某个地方的所有选票总和除以该地点的投票数,这样的话(伪代码):
vote_count = total number of votes for the place
vote_sum = total sum of all the votes for the place
vote_score = vote_sum/vote_count
如果一个地方没有选票,我还必须处理除零。所有这些都在存储过程中完成,该存储过程获取我想在顶部列表中显示的其他数据。以下是当前存储过程,它获取投票得分最高的前25个位置:
ALTER PROCEDURE [dbo].[GetTopListByCity]
(
@city_id Int
)
AS
SELECT TOP 25 dbo.Places.place_id,
dbo.Places.city_id,
dbo.Places.place_name,
dbo.Places.place_alias,
dbo.Places.place_street_address,
dbo.Places.place_street_number,
dbo.Places.place_zip_code,
dbo.Cities.city_name,
dbo.Cities.city_alias,
dbo.Places.place_phone,
dbo.Places.place_lat,
dbo.Places.place_lng,
ISNULL(SUM(dbo.Votes.vote_score),0) AS vote_sum,
(SELECT COUNT(*) FROM dbo.Votes WHERE dbo.Votes.place_id = dbo.Places.place_id) AS vote_count,
COALESCE((CONVERT(FLOAT,SUM(dbo.Votes.vote_score))/(CONVERT(FLOAT,(SELECT COUNT(*) FROM dbo.Votes WHERE dbo.Votes.place_id = dbo.Places.place_id)))),0) AS vote_score
FROM dbo.Places INNER JOIN dbo.Cities ON dbo.Places.city_id = dbo.Cities.city_id
LEFT OUTER JOIN dbo.Votes ON dbo.Places.place_id = dbo.Votes.place_id
WHERE dbo.Places.city_id = @city_id
AND dbo.Places.hidden = 0
GROUP BY dbo.Places.place_id,
dbo.Places.city_id,
dbo.Places.place_name,
dbo.Places.place_alias,
dbo.Places.place_street_address,
dbo.Places.place_street_number,
dbo.Places.place_zip_code,
dbo.Cities.city_name,
dbo.Cities.city_alias,
dbo.Places.place_phone,
dbo.Places.place_lat,
dbo.Places.place_lng
ORDER BY vote_score DESC, vote_count DESC, place_name ASC
RETURN
正如你所看到的,它不仅仅取得了投票得分 - 我需要有关这个地方,它所在的城市等的数据。这样做很好,但有一个大问题:投票得分太简单了,因为它没有考虑到投票数。使用简单的计算方法,一个投票得分为3的地方将在列表中高于一个有十四票,得分为3,一票得分为2的地方:
3/1 = 3
(14*3 + 1*2) = 44/15 = 2.933333333333
为了解决这个问题,我一直在研究使用某种形式的加权平均值/加权指数。我发现了一个真正的贝叶斯估计的例子看起来很有希望。它看起来像这样:
weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
where:
R = average for the place (mean) = (Rating)
v = number of votes for the place = (votes)
m = minimum number of votes required to be listed in the Top 25 (unsure how many, but somewhere between 2-5 seems realistic)
C = the mean vote across the whole database
当我尝试在存储过程中实现此加权评级时,问题就开始了 - 它很快变得复杂,我纠结于括号和松散跟踪存储过程的作用。
现在我需要一些帮助来解决两个问题:
这是计算网站加权指数的合适方法吗?
在存储过程中实现时,这个(或其他合适的计算方法)会是什么样子?
答案 0 :(得分:1)
我看不出你的计算有什么问题。但我可以看到你做了很多次同样的事情。我的建议将帮助您在一个地方进行聚合,然后选择非常简单。
;WITH CTE
(
SELECT
SUM(dbo.Votes.vote_score) AS SumOfVoteScore,
COUNT(*) AS CountOfVotes,
Votes.place_id
FROM
Votes
GROUP BY
Votes.place_id
)
SELECT TOP 25
dbo.Places.place_id,
dbo.Places.city_id,
dbo.Places.place_name,
dbo.Places.place_alias,
dbo.Places.place_street_address,
dbo.Places.place_street_number,
dbo.Places.place_zip_code,
dbo.Cities.city_name,
dbo.Cities.city_alias,
dbo.Places.place_phone,
dbo.Places.place_lat,
dbo.Places.place_lng,
ISNULL(CTE.SumOfVoteScore,0) AS vote_sum,
CTE.CountOfVotes AS vote_count,
COALESCE((CONVERT(FLOAT,CTE.SumOfVoteScore)/
(CONVERT(FLOAT,CTE.CountOfVotes))),0) AS vote_score
FROM dbo.Places INNER JOIN dbo.Cities ON dbo.Places.city_id = dbo.Cities.city_id
LEFT JOIN CTE ON dbo.Places.place_id=CTE.place_id
WHERE dbo.Places.city_id = @city_id
AND dbo.Places.hidden = 0
GROUP BY dbo.Places.place_id,
dbo.Places.city_id,
dbo.Places.place_name,
dbo.Places.place_alias,
dbo.Places.place_street_address,
dbo.Places.place_street_number,
dbo.Places.place_zip_code,
dbo.Cities.city_name,
dbo.Cities.city_alias,
dbo.Places.place_phone,
dbo.Places.place_lat,
dbo.Places.place_lng
ORDER BY vote_score DESC, vote_count DESC, place_name ASC
CTE功能帮助我们重复使用计算。这样我们就不必使用SUM(vote_score)
和SELECT COUNT(*) FROM Votes WHERE...
倍数。那么当你选择计算时很容易理解。
我希望这会有所帮助
修改强>
您不必在CTE中定义表列。此CTE (SumOfVoteScore, CountOfVotes, place_id) AS
的效果与CTE AS
一样好。如果使用递归cte,则需要定义列。因为另一部分你是union
。
答案 1 :(得分:0)
感谢Arion!
我一直在寻找CTE的东西,但我只是不知道那是我在寻找的东西!学习新东西总是很好,我知道我会在其他项目中使用CTE。当我在我的存储过程中实现您的CTE时,我得到了这段代码:
ALTER PROCEDURE dbo.GetTopListByCityCTE
(
@city_id Int
)
AS
;WITH CTE (SumOfVoteScore, CountOfVotes, place_id) AS
(
SELECT
SUM(dbo.Votes.vote_score) AS SumOfVoteScore,
COUNT(*) AS CountOfVotes,
Votes.place_id
FROM
Votes
GROUP BY
Votes.place_id
)
SELECT TOP 25
dbo.Places.place_id,
dbo.Places.city_id,
dbo.Places.place_name,
dbo.Places.place_alias,
dbo.Places.place_street_address,
dbo.Places.place_street_number,
dbo.Places.place_zip_code,
dbo.Cities.city_name,
dbo.Cities.city_alias,
dbo.Places.place_phone,
dbo.Places.place_lat,
dbo.Places.place_lng,
ISNULL(CTE.SumOfVoteScore,0) AS vote_sum,
CTE.CountOfVotes AS vote_count,
COALESCE((CONVERT(FLOAT,CTE.SumOfVoteScore)/
(CONVERT(FLOAT,CTE.CountOfVotes))),0) AS vote_score
FROM dbo.Places INNER JOIN dbo.Cities ON dbo.Places.city_id = dbo.Cities.city_id
LEFT JOIN CTE ON dbo.Places.place_id = CTE.place_id
WHERE dbo.Places.city_id = @city_id
AND dbo.Places.hidden = 0
GROUP BY dbo.Places.place_id,
dbo.Places.city_id,
dbo.Places.place_name,
dbo.Places.place_alias,
dbo.Places.place_street_address,
dbo.Places.place_street_number,
dbo.Places.place_zip_code,
dbo.Cities.city_name,
dbo.Cities.city_alias,
dbo.Places.place_phone,
dbo.Places.place_lat,
dbo.Places.place_lng,
CTE.SumOfVoteScore,
CTE.CountOfVotes
ORDER BY vote_score DESC, vote_count DESC, place_name ASC
快速检查显示它返回与前一代码相同的结果,但它更容易阅读和遵循,希望更有效。
现在我将不得不做一些尝试,用一个考虑了投票数的新的(简单的)评级计算方法来替换它。
答案 2 :(得分:0)
好的 - 所以这是我提出的存储过程:
ALTER PROCEDURE dbo.GetTopListByCityCTE
(
@city_id Int
)
AS
DECLARE @MinimumNumber float;
DECLARE @TotalNumberOfVotes int;
DECLARE @AverageRating float;
DECLARE @AverageNumberOfVotes float;
/* MINIMUM NUMBER */
SET @MinimumNumber = 1;
/* TOTAL NUMBER OF VOTES -- ALL PLACES */
SET @TotalNumberOfVotes = (
SELECT COUNT(*) FROM Votes
);
/* AVERAGE RATING -- ALL PLACES */
SET @AverageRating = (
SELECT
CONVERT(FLOAT,(SUM(dbo.Votes.vote_score))) / CONVERT(FLOAT,COUNT(*)) AS AverageRating
FROM
Votes);
/* AVERAGE NUMBER OF VOTES -- ALL PLACES */
/* CURRENTLY NOT USED IN INDEX - KEPT FOR REFERENCE */
SET @AverageNumberOfVotes = (
SELECT AVG(CONVERT(FLOAT,NumberOfVotes)) FROM (SELECT COUNT(*) AS NumberOfVotes FROM Votes GROUP BY place_id) AS AverageNumberOfVotes
);
/* SUM OF ALL VOTE SCORES AND COUNT OF ALL VOTES -- INDIVIDUAL PLACES */
WITH CTE AS (
SELECT
CONVERT(FLOAT, SUM(dbo.Votes.vote_score)) AS SumVotesForPlace,
CONVERT(FLOAT, COUNT(*)) AS CountVotesForPlace,
Votes.place_id
FROM
Votes
GROUP BY
Votes.place_id
)
SELECT
dbo.Places.place_id,
dbo.Places.city_id,
dbo.Places.place_name,
dbo.Places.place_alias,
dbo.Places.place_street_address,
dbo.Places.place_street_number,
dbo.Places.place_zip_code,
dbo.Cities.city_name,
dbo.Cities.city_alias,
dbo.Places.place_phone,
dbo.Places.place_lat,
dbo.Places.place_lng,
ISNULL(CTE.SumVotesForPlace,0) AS vote_sum,
ISNULL(CTE.CountVotesForPlace,0) AS vote_count,
COALESCE((CTE.SumVotesForPlace/
CTE.CountVotesForPlace),0) AS vote_score,
ISNULL((CTE.CountVotesForPlace / (CTE.CountVotesForPlace + @MinimumNumber)) * (COALESCE((CTE.SumVotesForPlace / CTE.CountVotesForPlace),0)) + (@MinimumNumber / (CTE.CountVotesForPlace + @MinimumNumber)) * @AverageRating,0) AS WeightedIndex
FROM dbo.Places INNER JOIN dbo.Cities ON dbo.Places.city_id = dbo.Cities.city_id
LEFT JOIN CTE ON dbo.Places.place_id = CTE.place_id
WHERE dbo.Places.city_id = @city_id
AND dbo.Places.hidden = 0
GROUP BY dbo.Places.place_id,
dbo.Places.city_id,
dbo.Places.place_name,
dbo.Places.place_alias,
dbo.Places.place_street_address,
dbo.Places.place_street_number,
dbo.Places.place_zip_code,
dbo.Cities.city_name,
dbo.Cities.city_alias,
dbo.Places.place_phone,
dbo.Places.place_lat,
dbo.Places.place_lng,
CTE.SumVotesForPlace,
CTE.CountVotesForPlace
ORDER BY WeightedIndex DESC, vote_count DESC, place_name ASC
有一个名为@AverageNumberOfVotes的变量,它没有在计算中使用,但我把它保存在那里以供参考,以备不时之需。
针对我拥有的数据运行此操作,我得到的结果与之前略有不同,但这不是革命,也不是我需要的。以下是执行上述SP时返回的前10行:
vote_sum vote_count vote_score WeightedIndex
1110 409 2,71393643031785 2,7140960047496
807 310 2,60322580645161 2,60449697749787
38 15 2,53333333333333 2,56708633093525
25 10 2,5 2,55442722744881
2 1 2 2,55188848920863
2 1 2 2,55188848920863
2 1 2 2,55188848920863
2 1 2 2,55188848920863
2 1 2 2,55188848920863
2 1 2 2,55188848920863
这里的问题似乎是,只有一票,得分为2,加权指数变为2,55188848920863?
计算此索引的公式取自IMDB(http://www.imdb.com/chart/top),我认为我做错了或者我的数据库中的数据与数据无法比较(数字) IMDB的投票或投票比例?
有没有办法可以调整这个功能,这对我来说效果更好?是否有更好的功能/方法?我仍然需要在存储过程中进行计算。