通过SQL查询帮助识别论坛垃圾邮件发送者?

时间:2010-11-15 17:34:56

标签: sql-server tsql

我想有一个简单的查询,我可以针对数据库运行,以返回用户发布到我们论坛的时间阈值中的异常。如果我有以下数据库结构:

ThreadId | UserId | PostAuthor | PostDate |
1          1000     Spammer      2010-11-14 02:52:50.093
2          1000     Spammer      2010-11-14 02:53:06.893
3          1000     Spammer      2010-11-14 02:53:22.130
4          1000     Spammer      2010-11-14 02:53:37.073
5          2000     RealUser     2010-11-14 02:53:52.383
6          1000     Spammer      2010-11-14 02:54:07.430 

我想设置一个阈值,例如,如果来自同一用户的3个帖子属于1分钟的时间段,那么海报可能会在论坛上发送垃圾邮件。反过来,我想在查询中返回用户'Spammer',其中包含在指定时间内发布的帖子数量。

在上面的示例中,Spammer在1分钟的时间段内发布了4条消息,因此查询结果可能如下所示:

UserId | PostAuthor | PostCount | DateStart               | DateEnd
1000     Spammer      4           2010-11-14 02:52:50.093   2010-11-14 02:53:37.073

我们欢迎以返回数据的格式提出任何建议。格式与正确识别论坛滥用者无关紧要。

5 个答案:

答案 0 :(得分:1)

输出中没有您想要的所有内容,但这是一个开始:

(Reword:给我所有帖子,其后有2个或更多其他帖子,但在一分钟之内)

Select 
  Spammer = PostAuthor,
  NumberOfPosts = (Select Count(*) 
                   From Posts As AllPosts 
                   Where AllPosts.UserID = Posts.UserID)
From Posts
Where 2 <= (Select Count(*)
            From Posts As OtherPosts
            Where OtherPosts.UserID = Posts.UserID
              And OtherPosts.PostDate > Posts.PostDate
              And OtherPosts.PostDate < DateAdd(Minute, 1, Posts.PostDate))

答案 1 :(得分:1)

自我加入解决方案:

Select T1.UserId, T1.PostAuthor, T1.PostDate, Max(T2.PostDate), Count(*)
from
  Posts T1 INNER JOIN Posts T2 
  ON T1.UserId = T2.UserId and 
     T2.PostDate between T1.PostDate and dateadd(minute, 1, T1.PostDate)
group by T1.UserId, T1.PostAuthor, T1.PostDate
having count(*) >= 3

答案 2 :(得分:0)

我正在尝试这一点,并想出了这个(我猜它与Stu的结果几乎相同,尽管是帖子的数量)。这标识了在1分钟内有3个帖子的用户(因此,如果是5个帖子,则会重复用户3次)

DECLARE @threshold INT;
SET @threshold = 3;

;WITH postCTE as
(
SELECT 
  Userid,
  PostAuthor,
  PostDate,
  RowNumber = ROW_NUMBER() OVER (PARTITION by UserId ORDER BY PostDate ASC)
FROM Posts
)
SELECT 
  p1.UserId, 
  p1.PostAuthor, 
  p1.PostDate AS StartTime, 
  p2.PostDate AS EndTime
FROM postCTE p1
   JOIN postCTE p2 
     ON p1.UserId = p2.UserId 
     AND p1.Rownumber = p2.RowNumber - (@threshold - 1)
WHERE DATEDIFF(MINUTE,p1.PostDate,p2.PostDate) <= 1

返回以下结果集

UserId   PostAuthor  StartTime                EndTime
1000    Spammer    2010-11-14 02:52:50.093  2010-11-14 02:53:22.130
1000    Spammer    2010-11-14 02:53:06.893  2010-11-14 02:53:37.073
1000    Spammer    2010-11-14 02:53:22.130  2010-11-14 02:54:07.430

答案 3 :(得分:0)

我相信Sadhir走在正确的轨道上。我对脚本有一些更正。第一次修正涉及使用'分钟'的DATADIFF单位。使用分钟将无法正确返回George的示例中的四条记录。我将'分钟'改为'秒'。我还通过计算CTE中的rownumbers之间的差异来格式化输出以显示在一分钟内记录的帖子数量。 虽然乔治没有要求它,但我添加了一个参数来控制回顾表格的天数,因为我认为每次都不想有人想要整个表格。

DECLARE @threshold INT; 
SET @threshold = 3; 
DECLARE @lookbackdays int;
SET @lookbackdays = 2;

;WITH postCTE as 
( 
SELECT  
    Userid, 
    PostAuthor, 
    PostDate, 
    RowNumber = ROW_NUMBER() OVER (ORDER BY UserId,PostDate ASC) 
FROM 
    Post2Forum 
WHERE 
    PostDate > GETDATE() - @lookbackdays
) 
SELECT  
    p1.PostAuthor AS [PostAuthor],  
    p2.RowNumber - p1.RowNumber +1 AS [PostCount],
    p1.UserId,  
    p1.PostDate AS [DateStart],  
    p2.PostDate AS [DateEnd] 
FROM 
    postCTE p1 
INNER JOIN 
    postCTE p2  
    ON p1.UserId = p2.UserId  
    AND p1.Rownumber = p2.RowNumber - (@threshold ) 
WHERE 
    DATEDIFF(second,p1.PostDate,p2.PostDate) <= 60

我的测试中的查询结果是:

PostAuthor PostCount UserId                   DateStart            DateEnd
Spammer           4   1000 2010-11-14 02:52:50.093  2010-11-14 02:53:37.073

答案 4 :(得分:-1)

不完全是你想要的,但会或多或少地达到目的......

SELECT 
  UserId, 
  PostAuthor, 
  COUNT(*) AS [PostCount],
  YEAR(PostDate), 
  MONTH(PostDate), 
  DAY(PostDate), 
  DATEPART(hh, PostDate), 
  DATEPART(mi, PostDate)
FROM LogTable
GROUP BY 
  UserId, 
  PostAuthor, 
  YEAR(PostDate), 
  MONTH(PostDate), 
  DAY(PostDate), 
  DATEPART(hh, PostDate), 
  DATEPART(mi, PostDate)
HAVING COUNT(*) >= 3
ORDER BY 
  YEAR(PostDate), 
  MONTH(PostDate), 
  DAY(PostDate), 
  DATEPART(hh, PostDate), 
  DATEPART(mi, PostDate)