将具有键和注释字段的表转换为列字段中每个单词的键和行

时间:2019-02-11 22:46:11

标签: tsql

我有一个包含非结构化数据的表,我正在尝试分析该表以尝试建立关系查找。我没有使用词云软件。

我真的不知道如何解决这个问题。寻找解决方案使我找到了可能会花钱的工具,而不是编码的解决方案。

基本上我的数据如下:

CK1          CK2          Comment
--------------------------------------------------------------
 1            A           This is a comment.
 2            A           Another comment here.

这就是我需要创建的:

CK1          CK2          Words
--------------------------------------------------------------
 1            A           This
 1            A           is
 1            A           a
 1            A           comment.
 2            A           Another
 2            A           comment
 2            A           here.

2 个答案:

答案 0 :(得分:0)

1)如果我们使用的是SQL Server 2016及更高版本,则可能     使用内置功能STRING_SPLIT

-- SQL 2016and above
DECLARE @txt NVARCHAR(100) = N'This is a comment.'
select [value] from STRING_SPLIT(@txt, ' ') 

2)仅当1不适合时,如果分隔数(在我们的情况下为空格)小于3(适合您的样本数据),则我们应该使用PARSENAME

-- BEFORE SQL 2016 if we have less than 4 parts
DECLARE @txt NVARCHAR(100) = N'This is a comment.'
DECLARE @Temp NVARCHAR(200) = REPLACE (@txt,'.','@')
SELECT t FROM (VALUES(1),(2),(3),(4))T1(n)
CROSS APPLY (SELECT REPLACE(PARSENAME(REPLACE(@Temp,' ','.'),T1.n), '@','.'))T2(t)

3)仅当1和2不适合时,才应使用SQLCLR函数

http://dataeducation.com/sqlclr-string-splitting-part-2-even-faster-even-more-scalable/

4)仅当我们不能使用1,2并且不能使用SQLCLR(这意味着真正的管理问题,并且没有安全性,因为您可以在只读数据库中拥有所有SQLCLR函数以供所有用户使用) ,正如我在演讲中所解释的那样),那么您可以使用T-SQL并创建UDF。

https://sqlperformance.com/2012/07/t-sql-queries/split-strings

答案 1 :(得分:0)

您要执行的操作是使用空格作为定界符来标记字符串。在SQL世界中,人们通常将执行此操作的函数称为“拆分器”。对此类事物使用分隔符的潜在陷阱是如何用多个空格,制表符,CHAR(10),CHAR(13),CHAR()等分隔单词。语法较差,例如因为在句点后不添加空格会导致:

" End of sentence.Next sentence" 

句子。下一步作为单词返回。

我喜欢标记人类文字的方式是:

  1. 用空格替换不是字符的所有文本
  2. 替换重复的空格
  3. 修剪字符串
  4. 使用空格作为分隔符分割新转换的字符串。

下面是我的解决方案,后面是DDL以创建使用的功能。

-- Sample Data
DECLARE @yourtable TABLE (CK1 INT, CK2 CHAR(1), Comment VARCHAR(8000));
INSERT @yourtable (CK1, CK2, Comment)
VALUES
(1,'A','This is a typical comment...Follewed by another...'),
(2,'A','This comment has  double  spaces  and       tabs and even carriage
returns!');

-- Solution
SELECT      t.CK1, t.CK2, split.itemNumber, split.itemIndex, split.itemLength, split.item
FROM        @yourtable                                              AS t
CROSS APPLY samd.patReplace(t.Comment,'[^a-zA-Z ]',' ')             AS c1
CROSS APPLY samd.removeDupChar8K(c1.newString,' ')                  AS c2 
CROSS APPLY samd.delimitedSplitAB8K(LTRIM(RTRIM(c2.NewString)),' ') AS split;

结果(为简洁起见,被删节):

CK1         CK2  itemNumber           itemIndex   itemLength  item
----------- ---- -------------------- ----------- ----------- --------------
1           A    1                    1           4           This
1           A    2                    6           2           is
1           A    3                    9           1           a
1           A    4                    11          7           typical
1           A    5                    19          7           comment
...
2           A    1                    1           4           This
2           A    2                    6           7           comment
2           A    3                    14          3           has
2           A    4                    18          6           double
... 

请注意,我正在使用的拆分器基于Jeff Moden的Delimited Split8K,其运行时间为两个星期。

使用的功能:

CREATE FUNCTION dbo.rangeAB
(
  @low  bigint, 
  @high bigint, 
  @gap  bigint,
  @row1 bit
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH L1(N) AS 
(
  SELECT 1
  FROM (VALUES
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0)) T(N) -- 90 values 
),
L2(N)  AS (SELECT 1 FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c),
iTally AS (SELECT rn = ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)
SELECT r.RN, r.OP, r.N1, r.N2
FROM
(
  SELECT
    RN = 0,
    OP = (@high-@low)/@gap,
    N1 = @low,
    N2 = @gap+@low
  WHERE @row1 = 0
  UNION ALL -- COALESCE required in the TOP statement below for error handling purposes
  SELECT TOP (ABS((COALESCE(@high,0)-COALESCE(@low,0))/COALESCE(@gap,0)+COALESCE(@row1,1)))
    RN = i.rn,
    OP = (@high-@low)/@gap+(2*@row1)-i.rn,
    N1 = (i.rn-@row1)*@gap+@low,
    N2 = (i.rn-(@row1-1))*@gap+@low
  FROM iTally AS i
  ORDER BY i.rn
) AS r
WHERE @high&@low&@gap&@row1 IS NOT NULL AND @high >= @low AND @gap > 0;
    GO

CREATE FUNCTION samd.NGrams8k
(
  @string VARCHAR(8000), -- Input string
  @N      INT            -- requested token size
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
  position   = r.RN,
  token      = SUBSTRING(@string, CHECKSUM(r.RN), @N)
FROM  dbo.rangeAB(1, LEN(@string)+1-@N,1,1) AS r
WHERE @N > 0 AND @N <= LEN(@string);
GO

    CREATE FUNCTION samd.patReplace8K
(
  @string  VARCHAR(8000),
  @pattern VARCHAR(50),
  @replace VARCHAR(20)
) 
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString = 
  (
    SELECT   CASE WHEN @string = CAST('' AS VARCHAR(8000)) THEN CAST('' AS VARCHAR(8000))
                  WHEN @pattern+@replace+@string IS NOT NULL THEN 
                    CASE WHEN PATINDEX(@pattern,token COLLATE Latin1_General_BIN)=0
                         THEN ng.token ELSE @replace END END
    FROM     samd.NGrams8K(@string, 1) AS ng
    ORDER BY ng.position
    FOR XML PATH(''),TYPE
  ).value('text()[1]', 'VARCHAR(8000)');
GO

    CREATE FUNCTION samd.delimitedSplitAB8K
(
  @string    VARCHAR(8000), -- input string
  @delimiter CHAR(1)        -- delimiter
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
  itemNumber   = ROW_NUMBER() OVER (ORDER BY d.p),
  itemIndex    = CHECKSUM(ISNULL(NULLIF(d.p+1, 0),1)),
  itemLength   = CHECKSUM(item.ln),
  item         = SUBSTRING(@string, d.p+1, item.ln)
FROM (VALUES (DATALENGTH(@string))) AS l(s) -- length of the string
CROSS APPLY
(
  SELECT 0 UNION ALL -- for handling leading delimiters
  SELECT ng.position
  FROM   samd.NGrams8K(@string, 1) AS ng
  WHERE  token = @delimiter
) AS d(p) -- delimiter.position
CROSS APPLY (VALUES(  --LEAD(d.p, 1, l.s+l.d) OVER (ORDER BY d.p) - (d.p+l.d)
  ISNULL(NULLIF(CHARINDEX(@delimiter,@string,d.p+1),0)-(d.p+1), l.s-d.p))) AS item(ln);
GO

CREATE FUNCTION dbo.RemoveDupChar8K(@string varchar(8000), @char char(1))
RETURNS TABLE WITH SCHEMABINDING AS RETURN

SELECT NewString = 
 replace(replace(replace(replace(replace(replace(replace(
 @string COLLATE LATIN1_GENERAL_BIN,
 replicate(@char,33), @char), --33
 replicate(@char,17), @char), --17
 replicate(@char,9 ), @char), -- 9
 replicate(@char,5 ), @char), -- 5
 replicate(@char,3 ), @char), -- 3 
 replicate(@char,2 ), @char), -- 2
 replicate(@char,2 ), @char); -- 2
GO