SQL Server-查找列中最常见单词的出现频率(按行而不是单词)

时间:2019-05-21 02:29:28

标签: sql sql-server

这个问题已经问了好几次了,但是我找不到我需要的具体答案。我有一个查询,该查询在SQL Server的列中查找最常见的单词,并列出它们的出现次数。问题是,如果一个单词连续出现多次,则每次出现都会计数一次。我想每行只对每个单词计数一次。

因此,行“ to be or not be”的值将“ to”和“ be”分别计数一次,而不是总频率的两次。

这是当前查询,它还会去除诸如代词之类的常见单词,并用空格替换所有常见的分隔符。它有点旧,所以我怀疑它可能会更整洁。

    SELECT   sep.Col Phrase, count(*) as Qty
    FROM (
        Select * FROM (
            Select value = Upper(RTrim(LTrim(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Title, ',', ' '), '.', ' '), '!', ' '), '+', ' '), ':', ' '), '-', ' '), ';', ' '), '(', ' '), ')', ' '), '/', ' '), '&', ''), '?', ' '), '  ', ' '), '  ', ' ')))) 
            FROM Table
        ) easyValues
        Where value <> ''
        ) actualValues 
        Cross Apply dbo.SeparateValues(value, ' ') sep
    WHERE sep.Col not in ('', 'THE', 'A', 'AN', 'WHO', 'BOOK', 'AND', 'FOR', 'ON', 'HAVE', 'YOUR', 'HOW', 'WE', 'IN', 'I', 'IT', 'BY', 'SO', 'THEIR', 'IS', 'OR', 'HE', 'OF', 'WHAT'
                        , 'HIM', 'HIS', 'SHE', 'HER', 'MY', 'FROM', 'US', 'OUR', 'AT', 'ALL', 'BE', 'OF', 'TO', 'YOU', 'WITH', 'THAT', 'THIS', 'WAS', 'ARE', 'THERE', 'BUT', 'HAS'
                        , '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'WILL', 'MORE', 'DIV', 'THAN', 'EACH', 'GET', 'ANY')
          and LEN(sep.Col) > 2
    GROUP By sep.Col
    HAVING count(*) > 1

在解决重复单词的问题时,请您对任何更好的方法的想法表示赞赏。

3 个答案:

答案 0 :(得分:2)

您只需两次GROUP BY

首先按sep.ColTable.ID删除一行中的重复项。您的表上有一些ID列,对吧?

其次,只需sep.Col即可获得最终计数。

我还使用CTE重写了您的查询以使其可读。至少对我来说,这种方式更具可读性。

WITH
easyValues
AS
(
    Select
        ID
        ,value = Upper(RTrim(LTrim(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Title, ',', ' '), '.', ' '), '!', ' '), '+', ' '), ':', ' '), '-', ' '), ';', ' '), '(', ' '), ')', ' '), '/', ' '), '&', ''), '?', ' '), '  ', ' '), '  ', ' ')))) 
    FROM Table
)
,actualValues
AS
(
    SELECT
        ID
        ,Value
    FROM easyValues
    Where value <> ''
)
,SeparateValues
AS
(
    SELECT
        ID
        ,sep.Col
    FROM
        actualValues
        Cross Apply dbo.SeparateValues(value, ' ') AS sep
    WHERE
        sep.Col not in ('', 'THE', 'A', 'AN', 'WHO', 'BOOK', 'AND', 'FOR', 'ON', 'HAVE', 'YOUR', 'HOW', 'WE', 'IN', 'I', 'IT', 'BY', 'SO', 'THEIR', 'IS', 'OR', 'HE', 'OF', 'WHAT'
                        , 'HIM', 'HIS', 'SHE', 'HER', 'MY', 'FROM', 'US', 'OUR', 'AT', 'ALL', 'BE', 'OF', 'TO', 'YOU', 'WITH', 'THAT', 'THIS', 'WAS', 'ARE', 'THERE', 'BUT', 'HAS'
                        , '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'WILL', 'MORE', 'DIV', 'THAN', 'EACH', 'GET', 'ANY')
        and LEN(sep.Col) > 2
)
,UniqueValues
AS
(
    SELECT
        ID, Col
    FROM
        SeparateValues
    GROUP BY
        ID, Col
)
SELECT
    Col AS Phrase
    ,count(*) as Qty
FROM UniqueValues
GROUP By Col
HAVING count(*) > 1
;

答案 1 :(得分:1)

要满足您的要求,您可以使用FUNCTION通过定界符''空格将字符串分成单词列表。借助此功能,您可以随后使用一些动态SQL(例如游标)来获取最终计数。

首先将FUNCTION创建为- 代码源:stackoverflow

CREATE  FUNCTION dbo.splitstring ( @stringToSplit VARCHAR(MAX) )
RETURNS @returnList TABLE ([Word] [nvarchar] (500))
AS
BEGIN
    DECLARE @name NVARCHAR(255)
    DECLARE @pos INT

    WHILE CHARINDEX(' ', @stringToSplit) > 0
    BEGIN
    SELECT @pos  = CHARINDEX(' ', @stringToSplit)  
    SELECT @name = SUBSTRING(@stringToSplit, 1, @pos-1)

    INSERT INTO @returnList 
    SELECT @name

    SELECT @stringToSplit = SUBSTRING(@stringToSplit, @pos+1, LEN(@stringToSplit)-@pos)
END

INSERT INTO @returnList
SELECT @stringToSplit

RETURN
END

然后使用此CURSOR脚本获取最终输出-

DECLARE @Value VARCHAR(MAX)
DECLARE @WordList TABLE
(
  Word VARCHAR(200)
)

DECLARE db_cursor CURSOR 
FOR 
SELECT Upper(RTrim(LTrim(Replace(Replace(Replace(Replace(Replace
                        (Replace(Replace(Replace(Replace(Replace(Replace(Replace
                        (Replace(Replace(title, ',', ' '), '.', ' '), '!', ' '), '+', ' '), ':', ' '), '-', ' '), ';', ' ')
                        , '(', ' '), ')', ' '), '/', ' '), '&', ''), '?', ' '), '  ', ' '), '  ', ' ')))) [Value]
FROM table

OPEN db_cursor  
FETCH NEXT FROM db_cursor INTO @Value  

WHILE @@FETCH_STATUS = 0  
BEGIN  

    INSERT INTO @WordList
    SELECT DISTINCT Word FROM [dbo].[splitstring](@Value)
    WHERE Word NOT IN ('', 'THE', 'A', 'AN', 'WHO', 'BOOK', 'AND', 'FOR', 'ON', 'HAVE', 'YOUR', 'HOW', 'WE', 'IN', 'I', 'IT', 'BY', 'SO', 'THEIR', 'IS', 'OR', 'HE', 'OF', 'WHAT'
                    , 'HIM', 'HIS', 'SHE', 'HER', 'MY', 'FROM', 'US', 'OUR', 'AT', 'ALL', 'BE', 'OF', 'TO', 'YOU', 'WITH', 'THAT', 'THIS', 'WAS', 'ARE', 'THERE', 'BUT', 'HAS'
                    , '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'WILL', 'MORE', 'DIV', 'THAN', 'EACH', 'GET', 'ANY')
    AND LEN(Word) > 2

    FETCH NEXT FROM db_cursor INTO @Value 
END 

CLOSE db_cursor  
DEALLOCATE db_cursor


SELECT Word,COUNT(*)
FROM @WordList
GROUP BY Word 

答案 2 :(得分:1)

据我所知,STRING_SPLIT函数与CROSS APPLY一起可以为您提供所需的内容。您可以根据空格分隔符分割字符串,分别选择每个单词,然后计算外部查询。我省略了为简洁起见没有选择特定单词的部分。

Fiddle<>

CREATE TABLE phrases(phrase NVARCHAR(MAX));

INSERT INTO phrases(phrase)VALUES(N'To be or not to be'),(N'this is not a phrase'),(N'And why is this not another one');

SELECT 
    w.value,
    COUNT(*) 
FROM 
    phrases AS p 
    CROSS APPLY (
        SELECT DISTINCT 
            value 
         FROM 
            STRING_SPLIT(p.phrase,N' ')
    ) AS w
GROUP BY 
    w.value;