我正在分析RawDataDescriptions'中的数据。表格带有'描述'用户输入的开放字段。
我正在寻找方法,通过短语或经常出现的字符串来广泛地对描述进行分类(包括它们出现次数的计数)。
我没有特定的单词或短语可以找到我可以使用' like'声明,相反,我正在寻找各个领域之间的共性。
在通过其他问题寻找这个问题的同时,我设法找到了一个查询,我根据自己的表调整了最常见的单词(粘贴在下面),但当然只有一个单词提供了很少的-if -inight描述。
是否可以进行查询以提供短语计数而不仅仅是单个单词?如果是这样,它的主要成分是什么?
WITH E1(N) AS
(
SELECT 1
FROM (VALUES
(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
) t(N)
),
E2(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),
E4(N) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b)
SELECT
x.Item,
COUNT(*)
FROM RawDataDescriptions p
CROSS APPLY (
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = LTRIM(RTRIM(SUBSTRING(p.[Description], l.N1, l.L1)))
FROM (
SELECT s.N1,
L1 = ISNULL(NULLIF(CHARINDEX(' ',p.[Description],s.N1),0)-
s.N1,4000)
FROM(
SELECT 1 UNION ALL
SELECT t.N+1
FROM(
SELECT TOP (ISNULL(DATALENGTH(p.[Description])/2,0))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E4
) t(N)
WHERE SUBSTRING(p.[Description] ,t.N,1) = ' '
) s(N1)
) l(N1, L1)
) x
WHERE x.item <> ''
GROUP BY x.Item
ORDER BY COUNT(*) DESC
*编辑 - 不可行。另类预期结果:
样本表
Id | Description
---+--------------------------
01 | Customer didn't like it
02 | Person liked it
03 | Person didn't like it
04 | Client didn't like it
05 | person liked it
@Parameter = 3
期望的结果:
string | count
-----------------+-------
didn't like it | 3
Person liked it | 2
编辑2 **原始问题是可行的 - 请参阅答案
答案 0 :(得分:2)
这是一个选项。我有几个问题,比如标点符号,控制字符,特别是大表的性能
示例强>
Declare @RawDataDescriptions Table ([Id] varchar(50),[Description] varchar(50))
Insert Into @RawDataDescriptions Values
('01','Customer didn''t like it')
,('02','Person liked it')
,('03','Person didn''t like it')
,('04','Client didn''t like it')
,('05','person liked it')
;with cte as (
Select Id
,B.*
From @RawDataDescriptions A
Cross Apply (
Select RetSeq = Row_Number() over (Order By (Select null))
,RetVal = LTrim(RTrim(B.i.value('(./text())[1]', 'varchar(max)')))
From (Select x = Cast('<x>' + replace((Select replace(A.[Description],' ','§§Split§§') as [*] For XML Path('')),'§§Split§§','</x><x>')+'</x>' as xml).query('.')) as A
Cross Apply x.nodes('x') AS B(i)
) B
)
Select Phrase
,Cnt = count(*)
From cte A
Cross Apply (
Select Phrase = stuff((Select ' '+RetVal
From cte
Where ID = A.ID
and RetSeq between A.RetSeq and A.RetSeq+2
Order By RetSeq
For XML Path('')),1,1,'')
) B
Where Phrase like '% % %'
Group By Phrase
Having count(*)>1
Order By 2 Desc
<强>返回强>
Phrase Cnt
didn't like it 3
Person liked it 2
更新 - TVF - 更好的表现
我决定将其转变为表值函数,并对性能提升感到震惊。例如,我有来自FRED(美联储经济数据)的130,000个描述,我能够在9秒内生成一个常用短语列表(n个单词)。
<强>用法强>
Select Phrase = B.RetVal
,Cnt = count(*)
From YourTable A
Cross Apply [dbo].[tvf-Str-Parse-Phrase](A.YourColumn,' ',4) B
Group By B.RetVal
Having count(*)>1
Order By 2 Desc
有兴趣的TVF
CREATE FUNCTION [dbo].[tvf-Str-Parse-Phrase] (@String varchar(max),@Delimeter varchar(25),@WordCnt int)
Returns Table
As
Return (
with cte as (
Select RetSeq = Row_Number() over (Order By (Select null))
,RetVal = LTrim(RTrim(B.i.value('(./text())[1]', 'varchar(max)')))
From (Select x = Cast('<x>' + replace((Select replace(@String,@Delimeter,'§§Split§§') as [*] For XML Path('')),'§§Split§§','</x><x>')+'</x>' as xml).query('.')) as A
Cross Apply x.nodes('x') AS B(i)
)
Select RetSeq = Row_Number() over (Order By (Select Null))
,B.RetVal
From cte A
Cross Apply (Select RetVal = stuff((Select ' '+RetVal From cte Where RetSeq between A.RetSeq and A.RetSeq+@WordCnt-1 For XML Path('')),1,1,'') ) B
Where B.RetVal like Replicate('% ',@WordCnt-1)+'%'
);
--Select * from [dbo].[tvf-Str-Parse-Phrase]('This is some text that I want parsed',' ',4)
答案 1 :(得分:0)
您可以在桌面上启用Microsoft Full Text Index
&amp;在桌子上进行这些寻找频繁的单词和字符分析。