让我们说系统中有记录,下面是几个:
我想搜索最常用的记录类型。在上面的记录中,“删除”这个词出现了很多。我想写一个查询,将最流行的记录分组。
我试过了:
SELECT NAME, COUNT(*)
FROM REQUESTS
GROUP BY NAME
我希望最后一栏名为“最受欢迎的关键词”
答案 0 :(得分:0)
这会找到每个字段中最常用的关键字:
DECLARE @T TABLE (ID INT IDENTITY, Name VARCHAR(50));
INSERT INTO @T
VALUES ('How to delete a record?')
, ('delete a record from system')
, ('user wants to delete products')
, ('delete multiple user accounts');
SELECT TOP (1)d.column1 AS Keywords, COUNT(*) AS KeywordCount
FROM @T AS T
CROSS APPLY dbo.GetTableFromList(T.Name, ' ') AS D
GROUP BY D.column1
ORDER BY COUNT(*) DESC;
dbo.GetTableFromList()
是一个分裂字符串函数(它使用给定的分隔符将字符串拆分为行)。可以找到解释这些内容的非常详细的文章here。在这种情况下,函数按空格分割每个Name
单元格,并为每个单词生成一行。
使用此代码使用CLR RegExReplace函数删除所有非字母数字字符以获得一致的结果会很棒:
SELECT *
FROM @T AS T
CROSS APPLY dbo.GetTableFromList(dbo.RegExReplace(T.Name, '[^A-Za-z0-9 ]', ' '), ' ') AS D;
因此,作为第一步,您可以看到使用此代码生成的内容:
SELECT *
FROM @T AS T
CROSS APPLY dbo.GetTableFromList(T.Name, ' ') AS D;
ID Name column1
-------------------------------------------
1 How to delete a record? How
1 How to delete a record? to
1 How to delete a record? delete
1 How to delete a record? a
1 How to delete a record? record?
2 delete a record from system delete
2 delete a record from system a
2 delete a record from system record
2 delete a record from system from
2 delete a record from system system
3 user wants to delete products user
3 user wants to delete products wants
3 user wants to delete products to
3 user wants to delete products delete
3 user wants to delete products products
4 delete multiple user accounts delete
4 delete multiple user accounts multiple
4 delete multiple user accounts user
4 delete multiple user accounts accounts
接下来要做的是按每个生成的关键字进行分组并按其计数排序,将最常用的关键字作为第一个返回(使用TOP (1)
语句)。
Keywords KeywordCount
------------------------
delete 4
a 2
to 2
user 2
wants 1
accounts 1
from 1
How 1
multiple 1
products 1
record 1
record? 1
system 1
如果您要使用CLR RegExReplace,那么它会返回一个更清晰的关键字列表(请参阅record
和record?
现在相同):
SELECT d.column1 AS Keywords, COUNT(*) AS KeywordCount
FROM @T AS T
CROSS APPLY dbo.GetTableFromList(dbo.RegExReplace(T.Name, '[^A-Za-z0-9 ]', ' '), ' ') AS D
GROUP BY D.column1
ORDER BY COUNT(*) DESC;
Keywords KeywordCount
------------------------
delete 4
a 2
record 2
to 2
user 2
wants 1
system 1
accounts 1
from 1
How 1
multiple 1
products 1
最重要的是。可能会有很多所谓的停用词(noisewords)。删除这些内容的可能方法是使用sys.fulltext_stopwords language_id = 1033
(英文),使用以下代码:
SELECT d.column1 AS Keywords, COUNT(*) AS KeywordCount
FROM @T AS T
CROSS APPLY dbo.GetTableFromList(dbo.RegExReplace(T.Name, '[^A-Za-z0-9 ]', ' '), ' ') AS D
WHERE NOT EXISTS (
SELECT 1
FROM sys.fulltext_stopwords AS FS
WHERE FS.stopword = CAST(d.column1 AS NVARCHAR(MAX)) COLLATE SQL_Latin1_General_CP1_CI_AS
AND FS.language_id = 1033
)
GROUP BY D.column1
ORDER BY COUNT(*) DESC;
<强> RESULT 强>
Keywords KeywordCount
------------------------
delete 4
record 2
user 2
wants 1
accounts 1
system 1
multiple 1
products 1
额外更新
此查询还会尝试查看基本单词表单,例如记录将被识别为记录。
请测试:
DECLARE @T TABLE (ID INT IDENTITY, Name VARCHAR(50));
INSERT INTO @T
VALUES ('How to delete a record?')
, ('delete a record from system')
, ('user wants to delete products')
, ('delete multiple user accounts')
, ('how does someone delete multiple records');
SELECT S.display_term AS Keywords, COUNT(*) AS KeywordCount
FROM @T AS T
CROSS APPLY (
SELECT CAST(D.column1 AS NVARCHAR(MAX)) COLLATE SQL_Latin1_General_CP1_CI_AS
FROM dbo.GetTableFromList(dbo.RegExReplace(T.Name, '[^A-Za-z0-9 ]', ' '), ' ') AS D
) AS D(Phrase)
CROSS APPLY (
SELECT TOP (1) S.display_term
FROM sys.dm_fts_parser('FORMSOF(FREETEXT, "' + D.Phrase + '")', 1033, 0, 0) AS S
WHERE S.source_term = D.Phrase
ORDER BY S.keyword
) AS S
WHERE NOT EXISTS (
SELECT 1
FROM sys.fulltext_stopwords AS FS
WHERE FS.stopword = D.Phrase
AND FS.language_id = 1033
)
GROUP BY S.display_term
ORDER BY COUNT(*) DESC;
答案 1 :(得分:0)
以下是针对小型桌面尺寸的解决方案,当两种桌面尺寸增加时,性能将快速下降
您可以理解,我使用了一个关键字表并手动填充它。这可以避免使用像#&#34;&#34;,&#34;和&#34;等嘈杂的单词,这可以被视为分裂的结果
/*
Create Table Requests (id int identity(1,1), Request nvarchar(max))
insert into Requests select 'How to delete a record?'
insert into Requests select 'delete a record from system'
insert into Requests select 'user wants to delete products'
insert into Requests select 'delete multiple user accounts'
*/
--Create Table Keywords (keyword nvarchar(100))
--insert into Keywords values ('delete'),('user'),('how'),('account'),('system'),('product')
select k.keyword, count(k.keyword) as cnt
from Requests r
cross apply Keywords k
where r.Request like ('%' + k.keyword + '%')
group by k.keyword
上述解决方案的更好方法是使用SQL Server full-text search
答案 2 :(得分:0)
如果您能够使用 SQL Server语义搜索执行此任务,它将为您提供表格中最常用的关键短语或关键字。
请参阅 SQL Server semantic search tutorial 以在SQL Server上启用全文搜索和语义搜索,并使用SQL语义为重要的关键短语创建全文索引和查询文本数据搜索功能 SemanticKeyPhraseTable
我还放置了可用于此要求的Select查询,
;WITH tbl as (
SELECT * FROM SemanticKeyPhraseTable(Requests, Request)
INNER JOIN Requests ON document_key = id
)
SELECT TOP 3 keyphrase, COUNT(*)
FROM tbl
GROUP BY keyphrase
ORDER BY COUNT(*) desc
我强烈建议您将全文索引集成到数据库解决方案中,并为文本分析任务配置语义搜索