基于SQL Server的消息板中的Word流行度排行榜

时间:2018-02-09 15:21:12

标签: sql-server split group-by sum alphanumeric

在SQL Server数据库中,我有一个表Messages,其中包含以下列:

  • Id INT(1,1)
  • 详情VARCHAR(5000)
  • DatetimeEntered DATETIME
  • PersonEntered VARCHAR(25)

消息非常基本,只允许使用字母数字字符和少数特殊字符,如下所示:

`¬!"£$%^&*()-_=+[{]};:'@#~\|,<.>/?

忽略撇号的大部分特殊字符,我需要的是列出每个单词的方法以及单词在详细信息列中出现的次数,然后我可以按PersonEntered和{ {1}}。

示例输出:

DatetimeEntered

它不需要特别聪明。如果Word Frequency ----------------- a 11280 the 10102 and 8845 when 2024 don't 2013 . . . dont被视为单独的单词,那就完全没问了。

我无法将单词拆分为名为don't的临时表。

有了临时表后,我会应用以下查询:

#Words

请帮忙。

1 个答案:

答案 0 :(得分:1)

就个人而言,我会删除几乎所有特殊字符,然后在空格字符上使用分割器。在您允许的字符中,只有'会出现在一个字词中;其他任何东西都是语法上的。

您尚未发布您正在使用的SQL版本,因此我将使用SQL Server 2017语法。如果您没有最新版本,则需要将TRANSLATE替换为嵌套REPLACE(所以REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, '¬',' '),...),'/',' '),'?',' '),并找到字符串拆分器(例如,Jeff Moden的{{ 3}})。

USE Sandbox;
GO
CREATE TABLE [Messages] (Detail varchar(5000));

INSERT INTO [Messages]
VALUES ('Personally, I would strip out almost all the special characters, and then use a splitter on the space character. Of your permitted characters, only `''` is going to appear in a word; anything else is going to be grammatical. You haven''t posted what version of SQL you''re using, so I''ve going to use SQL Server 2017 syntax. If you don''t have the latest version, you''ll need to replace `TRANSLATE` with a nested `REPLACE` (So `REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, ''¬'','' ''),...),''/'','' ''),''?'','' '')`, and find a string splitter (for example, Jeff Moden''s [DelimitedSplit8K](http://www.sqlservercentral.com/articles/Tally+Table/72993/)).'),
       ('As a note, this is going to perform **AWFULLY**. SQL Server is not designed for this type of work. I also imagine you''ll get some odd results and it''ll include numbers in there. Things like dates are going to get split out,, numbers like `9,000,000` would be treated as the words `9` and `000`, and hyperlinks will be separated.')
GO
WITH Replacements AS(
    SELECT TRANSLATE(Detail, '`¬!"£$%^&*()-_=+[{]};:@#~\|,<.>/?','                                 ') AS StrippedDetail
    FROM [Messages] M)
SELECT SS.[value], COUNT(*) AS WordCount
FROM Replacements R
     CROSS APPLY string_split(R.StrippedDetail,' ') SS
WHERE LEN(SS.[value]) > 0
GROUP BY SS.[value]
ORDER BY WordCount DESC;
GO
DROP TABLE [Messages];

请注意,这将执行 AWFULLY 。 SQL Server不是为此类工作而设计的。我还想象你会得到一些奇怪的结果,它会包含数字。像日期这样的东西会被拆分,像9,000,000这样的数字将被视为单词9000,超链接将被分开。