我想有一个简单的查询,我可以针对数据库运行,以返回用户发布到我们论坛的时间阈值中的异常。如果我有以下数据库结构:
ThreadId | UserId | PostAuthor | PostDate |
1 1000 Spammer 2010-11-14 02:52:50.093
2 1000 Spammer 2010-11-14 02:53:06.893
3 1000 Spammer 2010-11-14 02:53:22.130
4 1000 Spammer 2010-11-14 02:53:37.073
5 2000 RealUser 2010-11-14 02:53:52.383
6 1000 Spammer 2010-11-14 02:54:07.430
我想设置一个阈值,例如,如果来自同一用户的3个帖子属于1分钟的时间段,那么海报可能会在论坛上发送垃圾邮件。反过来,我想在查询中返回用户'Spammer',其中包含在指定时间内发布的帖子数量。
在上面的示例中,Spammer在1分钟的时间段内发布了4条消息,因此查询结果可能如下所示:
UserId | PostAuthor | PostCount | DateStart | DateEnd
1000 Spammer 4 2010-11-14 02:52:50.093 2010-11-14 02:53:37.073
我们欢迎以返回数据的格式提出任何建议。格式与正确识别论坛滥用者无关紧要。
答案 0 :(得分:1)
输出中没有您想要的所有内容,但这是一个开始:
(Reword:给我所有帖子,其后有2个或更多其他帖子,但在一分钟之内)
Select
Spammer = PostAuthor,
NumberOfPosts = (Select Count(*)
From Posts As AllPosts
Where AllPosts.UserID = Posts.UserID)
From Posts
Where 2 <= (Select Count(*)
From Posts As OtherPosts
Where OtherPosts.UserID = Posts.UserID
And OtherPosts.PostDate > Posts.PostDate
And OtherPosts.PostDate < DateAdd(Minute, 1, Posts.PostDate))
答案 1 :(得分:1)
自我加入解决方案:
Select T1.UserId, T1.PostAuthor, T1.PostDate, Max(T2.PostDate), Count(*)
from
Posts T1 INNER JOIN Posts T2
ON T1.UserId = T2.UserId and
T2.PostDate between T1.PostDate and dateadd(minute, 1, T1.PostDate)
group by T1.UserId, T1.PostAuthor, T1.PostDate
having count(*) >= 3
答案 2 :(得分:0)
我正在尝试这一点,并想出了这个(我猜它与Stu的结果几乎相同,尽管是帖子的数量)。这标识了在1分钟内有3个帖子的用户(因此,如果是5个帖子,则会重复用户3次)
DECLARE @threshold INT;
SET @threshold = 3;
;WITH postCTE as
(
SELECT
Userid,
PostAuthor,
PostDate,
RowNumber = ROW_NUMBER() OVER (PARTITION by UserId ORDER BY PostDate ASC)
FROM Posts
)
SELECT
p1.UserId,
p1.PostAuthor,
p1.PostDate AS StartTime,
p2.PostDate AS EndTime
FROM postCTE p1
JOIN postCTE p2
ON p1.UserId = p2.UserId
AND p1.Rownumber = p2.RowNumber - (@threshold - 1)
WHERE DATEDIFF(MINUTE,p1.PostDate,p2.PostDate) <= 1
返回以下结果集
UserId PostAuthor StartTime EndTime
1000 Spammer 2010-11-14 02:52:50.093 2010-11-14 02:53:22.130
1000 Spammer 2010-11-14 02:53:06.893 2010-11-14 02:53:37.073
1000 Spammer 2010-11-14 02:53:22.130 2010-11-14 02:54:07.430
答案 3 :(得分:0)
我相信Sadhir走在正确的轨道上。我对脚本有一些更正。第一次修正涉及使用'分钟'的DATADIFF单位。使用分钟将无法正确返回George的示例中的四条记录。我将'分钟'改为'秒'。我还通过计算CTE中的rownumbers之间的差异来格式化输出以显示在一分钟内记录的帖子数量。 虽然乔治没有要求它,但我添加了一个参数来控制回顾表格的天数,因为我认为每次都不想有人想要整个表格。
DECLARE @threshold INT;
SET @threshold = 3;
DECLARE @lookbackdays int;
SET @lookbackdays = 2;
;WITH postCTE as
(
SELECT
Userid,
PostAuthor,
PostDate,
RowNumber = ROW_NUMBER() OVER (ORDER BY UserId,PostDate ASC)
FROM
Post2Forum
WHERE
PostDate > GETDATE() - @lookbackdays
)
SELECT
p1.PostAuthor AS [PostAuthor],
p2.RowNumber - p1.RowNumber +1 AS [PostCount],
p1.UserId,
p1.PostDate AS [DateStart],
p2.PostDate AS [DateEnd]
FROM
postCTE p1
INNER JOIN
postCTE p2
ON p1.UserId = p2.UserId
AND p1.Rownumber = p2.RowNumber - (@threshold )
WHERE
DATEDIFF(second,p1.PostDate,p2.PostDate) <= 60
我的测试中的查询结果是:
PostAuthor PostCount UserId DateStart DateEnd
Spammer 4 1000 2010-11-14 02:52:50.093 2010-11-14 02:53:37.073
答案 4 :(得分:-1)
不完全是你想要的,但会或多或少地达到目的......
SELECT
UserId,
PostAuthor,
COUNT(*) AS [PostCount],
YEAR(PostDate),
MONTH(PostDate),
DAY(PostDate),
DATEPART(hh, PostDate),
DATEPART(mi, PostDate)
FROM LogTable
GROUP BY
UserId,
PostAuthor,
YEAR(PostDate),
MONTH(PostDate),
DAY(PostDate),
DATEPART(hh, PostDate),
DATEPART(mi, PostDate)
HAVING COUNT(*) >= 3
ORDER BY
YEAR(PostDate),
MONTH(PostDate),
DAY(PostDate),
DATEPART(hh, PostDate),
DATEPART(mi, PostDate)