在SQL Server数据库中,我有一个表Messages
,其中包含以下列:
INT(1,1)
VARCHAR(5000)
DATETIME
VARCHAR(25)
消息非常基本,只允许使用字母数字字符和少数特殊字符,如下所示:
`¬!"£$%^&*()-_=+[{]};:'@#~\|,<.>/?
忽略撇号的大部分特殊字符,我需要的是列出每个单词的方法以及单词在详细信息列中出现的次数,然后我可以按PersonEntered
和{ {1}}。
示例输出:
DatetimeEntered
它不需要特别聪明。如果Word Frequency
-----------------
a 11280
the 10102
and 8845
when 2024
don't 2013
.
.
.
和dont
被视为单独的单词,那就完全没问了。
我无法将单词拆分为名为don't
的临时表。
有了临时表后,我会应用以下查询:
#Words
请帮忙。
答案 0 :(得分:1)
就个人而言,我会删除几乎所有特殊字符,然后在空格字符上使用分割器。在您允许的字符中,只有'
会出现在一个字词中;其他任何东西都是语法上的。
您尚未发布您正在使用的SQL版本,因此我将使用SQL Server 2017语法。如果您没有最新版本,则需要将TRANSLATE
替换为嵌套REPLACE
(所以REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, '¬',' '),...),'/',' '),'?',' ')
,并找到字符串拆分器(例如,Jeff Moden的{{ 3}})。
USE Sandbox;
GO
CREATE TABLE [Messages] (Detail varchar(5000));
INSERT INTO [Messages]
VALUES ('Personally, I would strip out almost all the special characters, and then use a splitter on the space character. Of your permitted characters, only `''` is going to appear in a word; anything else is going to be grammatical. You haven''t posted what version of SQL you''re using, so I''ve going to use SQL Server 2017 syntax. If you don''t have the latest version, you''ll need to replace `TRANSLATE` with a nested `REPLACE` (So `REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, ''¬'','' ''),...),''/'','' ''),''?'','' '')`, and find a string splitter (for example, Jeff Moden''s [DelimitedSplit8K](http://www.sqlservercentral.com/articles/Tally+Table/72993/)).'),
('As a note, this is going to perform **AWFULLY**. SQL Server is not designed for this type of work. I also imagine you''ll get some odd results and it''ll include numbers in there. Things like dates are going to get split out,, numbers like `9,000,000` would be treated as the words `9` and `000`, and hyperlinks will be separated.')
GO
WITH Replacements AS(
SELECT TRANSLATE(Detail, '`¬!"£$%^&*()-_=+[{]};:@#~\|,<.>/?',' ') AS StrippedDetail
FROM [Messages] M)
SELECT SS.[value], COUNT(*) AS WordCount
FROM Replacements R
CROSS APPLY string_split(R.StrippedDetail,' ') SS
WHERE LEN(SS.[value]) > 0
GROUP BY SS.[value]
ORDER BY WordCount DESC;
GO
DROP TABLE [Messages];
请注意,这将执行 AWFULLY 。 SQL Server不是为此类工作而设计的。我还想象你会得到一些奇怪的结果,它会包含数字。像日期这样的东西会被拆分,像9,000,000
这样的数字将被视为单词9
和000
,超链接将被分开。