按流行键名分组

时间:2015-07-21 09:57:41

标签: sql sql-server sql-server-2008 tsql group-by

让我们说系统中有记录,下面是几个:

  1. 如何删除记录?
  2. 从系统中删除记录
  3. 用户想要删除产品
  4. 删除多个用户帐户
  5. 我想搜索最常用的记录类型。在上面的记录中,“删除”这个词出现了很多。我想写一个查询,将最流行的记录分组。

    我试过了:

    SELECT NAME, COUNT(*)
    FROM REQUESTS
    GROUP BY NAME
    

    我希望最后一栏名为“最受欢迎的关键词”

3 个答案:

答案 0 :(得分:0)

这会找到每个字段中最常用的关键字:

DECLARE @T TABLE (ID INT IDENTITY, Name VARCHAR(50));

INSERT INTO @T
VALUES ('How to delete a record?')
    , ('delete a record from system')
    , ('user wants to delete products')
    , ('delete multiple user accounts');

SELECT TOP (1)d.column1 AS Keywords, COUNT(*) AS KeywordCount
FROM @T AS T
CROSS APPLY dbo.GetTableFromList(T.Name, ' ') AS D
GROUP BY D.column1
ORDER BY COUNT(*) DESC;

dbo.GetTableFromList()是一个分裂字符串函数(它使用给定的分隔符将字符串拆分为行)。可以找到解释这些内容的非常详细的文章here。在这种情况下,函数按空格分割每个Name单元格,并为每个单词生成一行。

使用此代码使用CLR RegExReplace函数删除所有非字母数字字符以获得一致的结果会很棒:

SELECT *
FROM @T AS T
CROSS APPLY dbo.GetTableFromList(dbo.RegExReplace(T.Name, '[^A-Za-z0-9 ]', ' '), ' ') AS D;

因此,作为第一步,您可以看到使用此代码生成的内容:

SELECT *
FROM @T AS T
CROSS APPLY dbo.GetTableFromList(T.Name, ' ') AS D;

ID  Name                            column1
-------------------------------------------
1   How to delete a record?         How
1   How to delete a record?         to
1   How to delete a record?         delete
1   How to delete a record?         a
1   How to delete a record?         record?
2   delete a record from system     delete
2   delete a record from system     a
2   delete a record from system     record
2   delete a record from system     from
2   delete a record from system     system
3   user wants to delete products   user
3   user wants to delete products   wants
3   user wants to delete products   to
3   user wants to delete products   delete
3   user wants to delete products   products
4   delete multiple user accounts   delete
4   delete multiple user accounts   multiple
4   delete multiple user accounts   user
4   delete multiple user accounts   accounts

接下来要做的是按每个生成的关键字进行分组并按其计数排序,将最常用的关键字作为第一个返回(使用TOP (1)语句)。

Keywords    KeywordCount
------------------------
delete      4
a           2
to          2
user        2
wants       1
accounts    1
from        1
How         1
multiple    1
products    1
record      1
record?     1
system      1

如果您要使用CLR RegExReplace,那么它会返回一个更清晰的关键字列表(请参阅recordrecord?现在相同):

SELECT d.column1 AS Keywords, COUNT(*) AS KeywordCount
FROM @T AS T
CROSS APPLY dbo.GetTableFromList(dbo.RegExReplace(T.Name, '[^A-Za-z0-9 ]', ' '), ' ') AS D
GROUP BY D.column1
ORDER BY COUNT(*) DESC;

Keywords    KeywordCount
------------------------
delete      4
a           2
record      2
to          2
user        2
wants       1
system      1
accounts    1
from        1
How         1
multiple    1
products    1

最重要的是。可能会有很多所谓的停用词(noisewords)。删除这些内容的可能方法是使用sys.fulltext_stopwords language_id = 1033(英文),使用以下代码:

SELECT d.column1 AS Keywords, COUNT(*) AS KeywordCount
FROM @T AS T
CROSS APPLY dbo.GetTableFromList(dbo.RegExReplace(T.Name, '[^A-Za-z0-9 ]', ' '), ' ') AS D
WHERE NOT EXISTS (
        SELECT 1
        FROM sys.fulltext_stopwords AS FS
        WHERE FS.stopword = CAST(d.column1 AS NVARCHAR(MAX)) COLLATE SQL_Latin1_General_CP1_CI_AS
            AND FS.language_id = 1033
        )
GROUP BY D.column1
ORDER BY COUNT(*) DESC;

<强> RESULT

Keywords    KeywordCount
------------------------
delete      4
record      2
user        2
wants       1
accounts    1
system      1
multiple    1
products    1

额外更新

此查询还会尝试查看基本单词表单,例如记录将被识别为记录

请测试:

DECLARE @T TABLE (ID INT IDENTITY, Name VARCHAR(50));

INSERT INTO @T
VALUES ('How to delete a record?')
    , ('delete a record from system')
    , ('user wants to delete products')
    , ('delete multiple user accounts')
    , ('how does someone delete multiple records');

SELECT S.display_term AS Keywords, COUNT(*) AS KeywordCount
FROM @T AS T
CROSS APPLY (
    SELECT CAST(D.column1 AS NVARCHAR(MAX)) COLLATE SQL_Latin1_General_CP1_CI_AS
    FROM dbo.GetTableFromList(dbo.RegExReplace(T.Name, '[^A-Za-z0-9 ]', ' '), ' ') AS D
    ) AS D(Phrase)
CROSS APPLY (
    SELECT TOP (1) S.display_term
    FROM sys.dm_fts_parser('FORMSOF(FREETEXT, "' + D.Phrase + '")', 1033, 0, 0) AS S
    WHERE S.source_term = D.Phrase
    ORDER BY S.keyword
    ) AS S
WHERE NOT EXISTS (
        SELECT 1
        FROM sys.fulltext_stopwords AS FS
        WHERE FS.stopword = D.Phrase
            AND FS.language_id = 1033
        )
GROUP BY S.display_term
ORDER BY COUNT(*) DESC;

答案 1 :(得分:0)

以下是针对小型桌面尺寸的解决方案,当两种桌面尺寸增加时,性能将快速下降

您可以理解,我使用了一个关键字表并手动填充它。这可以避免使用像#&#34;&#34;,&#34;和&#34;等嘈杂的单词,这可以被视为分裂的结果

/*
Create Table Requests (id int identity(1,1), Request nvarchar(max))
insert into Requests select 'How to delete a record?'
insert into Requests select 'delete a record from system'
insert into Requests select 'user wants to delete products'
insert into Requests select 'delete multiple user accounts'
*/
--Create Table Keywords (keyword nvarchar(100))
--insert into Keywords values ('delete'),('user'),('how'),('account'),('system'),('product')
select k.keyword, count(k.keyword) as cnt
from Requests r
cross apply Keywords k 
where r.Request like ('%' + k.keyword + '%')
group by k.keyword

上述解决方案的更好方法是使用SQL Server full-text search

答案 2 :(得分:0)

如果您能够使用 SQL Server语义搜索执行此任务,它将为您提供表格中最常用的关键短语或关键字。

请参阅 SQL Server semantic search tutorial 以在SQL Server上启用全文搜索和语义搜索,并使用SQL语义为重要的关键短语创建全文索引和查询文本数据搜索功能 SemanticKeyPhraseTable

我还放置了可用于此要求的Select查询,

;WITH tbl as (
    SELECT * FROM SemanticKeyPhraseTable(Requests, Request)
    INNER JOIN Requests ON document_key = id
)
SELECT TOP 3 keyphrase, COUNT(*)
FROM tbl
GROUP BY keyphrase
ORDER BY COUNT(*) desc 

我强烈建议您将全文索引集成到数据库解决方案中,并为文本分析任务配置语义搜索