最常见的单词/字符组合SQL Server

时间:2017-10-20 15:22:40

标签: sql sql-server tsql

我正在分析RawDataDescriptions'中的数据。表格带有'描述'用户输入的开放字段。

我正在寻找方法,通过短语或经常出现的字符串来广泛地对描述进行分类(包括它们出现次数的计数)。

我没有特定的单词或短语可以找到我可以使用' like'声明,相反,我正在寻找各个领域之间的共性。

在通过其他问题寻找这个问题的同时,我设法找到了一个查询,我根据自己的表调整了最常见的单词(粘贴在下面),但当然只有一个单词提供了很少的-if -inight描述。

是否可以进行查询以提供短语计数而不仅仅是单个单词?如果是这样,它的主要成分是什么?

WITH E1(N) AS 
(
    SELECT 1 
    FROM (VALUES
        (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ) t(N)
),
E2(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),
E4(N) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b)
SELECT
    x.Item,
    COUNT(*)
FROM RawDataDescriptions p
CROSS APPLY (
SELECT 
        ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
        Item       = LTRIM(RTRIM(SUBSTRING(p.[Description], l.N1, l.L1)))
        FROM (
            SELECT s.N1,
                L1 = ISNULL(NULLIF(CHARINDEX(' ',p.[Description],s.N1),0)-
s.N1,4000)
            FROM(
                SELECT 1 UNION ALL
                SELECT t.N+1 
                FROM(
                    SELECT TOP (ISNULL(DATALENGTH(p.[Description])/2,0))
                        ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
                    FROM E4
                ) t(N)
                WHERE SUBSTRING(p.[Description] ,t.N,1) = ' '
            ) s(N1)
        ) l(N1, L1)
) x
WHERE x.item <> ''
GROUP BY x.Item
ORDER BY COUNT(*) DESC

*编辑 - 不可行。另类预期结果:

样本表

Id | Description  
---+--------------------------
01 | Customer didn't like it  
02 | Person liked it  
03 | Person didn't like it  
04 | Client didn't like it  
05 | person liked it   

@Parameter = 3

期望的结果:

string           | count  
-----------------+-------
didn't like it   | 3  
Person liked it  | 2  

编辑2 **原始问题是可行的 - 请参阅答案

2 个答案:

答案 0 :(得分:2)

这是一个选项。我有几个问题,比如标点符号,控制字符,特别是大表的性能

示例

Declare @RawDataDescriptions Table ([Id] varchar(50),[Description] varchar(50))
Insert Into @RawDataDescriptions Values 
 ('01','Customer didn''t like it')
,('02','Person liked it')
,('03','Person didn''t like it')
,('04','Client didn''t like it')
,('05','person liked it')

;with cte as (
    Select Id
          ,B.* 
      From  @RawDataDescriptions A
      Cross Apply (
                    Select RetSeq = Row_Number() over (Order By (Select null))
                          ,RetVal = LTrim(RTrim(B.i.value('(./text())[1]', 'varchar(max)')))
                    From  (Select x = Cast('<x>' + replace((Select replace(A.[Description],' ','§§Split§§') as [*] For XML Path('')),'§§Split§§','</x><x>')+'</x>' as xml).query('.')) as A 
                    Cross Apply x.nodes('x') AS B(i)
                  ) B 
)
Select Phrase
      ,Cnt  = count(*)
 From  cte A
 Cross Apply (
     Select Phrase = stuff((Select ' '+RetVal
                            From  cte 
                            Where ID = A.ID
                              and RetSeq between A.RetSeq and A.RetSeq+2
                            Order By RetSeq
                            For XML Path('')),1,1,'')

             ) B
  Where Phrase like '% % %'
  Group By Phrase
  Having count(*)>1
  Order By 2 Desc

<强>返回

Phrase           Cnt
didn't like it   3
Person liked it  2
  

更新 - TVF - 更好的表现

我决定将其转变为表值函数,并对性能提升感到震惊。例如,我有来自FRED(美联储经济数据)的130,000个描述,我能够在9秒内生成一个常用短语列表(n个单词)。

<强>用法

Select Phrase = B.RetVal
      ,Cnt    = count(*)
 From YourTable A
 Cross Apply [dbo].[tvf-Str-Parse-Phrase](A.YourColumn,' ',4) B
 Group By B.RetVal
 Having count(*)>1
 Order By 2 Desc

有兴趣的TVF

CREATE FUNCTION [dbo].[tvf-Str-Parse-Phrase] (@String varchar(max),@Delimeter varchar(25),@WordCnt int)
Returns Table 
As
Return (  
 with cte as (
      Select RetSeq = Row_Number() over (Order By (Select null))
            ,RetVal = LTrim(RTrim(B.i.value('(./text())[1]', 'varchar(max)')))
      From  (Select x = Cast('<x>' + replace((Select replace(@String,@Delimeter,'§§Split§§') as [*] For XML Path('')),'§§Split§§','</x><x>')+'</x>' as xml).query('.')) as A 
      Cross Apply x.nodes('x') AS B(i)
)
Select RetSeq = Row_Number() over (Order By (Select Null))
      ,B.RetVal
 From  cte A
 Cross Apply (Select RetVal = stuff((Select ' '+RetVal From cte Where RetSeq between A.RetSeq and A.RetSeq+@WordCnt-1 For XML Path('')),1,1,'') ) B
 Where B.RetVal like Replicate('% ',@WordCnt-1)+'%'
);
--Select * from [dbo].[tvf-Str-Parse-Phrase]('This is some text that I want parsed',' ',4)

答案 1 :(得分:0)

您可以在桌面上启用Microsoft Full Text Index&amp;在桌子上进行这些寻找频繁的单词和字符分析。