我有一个包含非结构化数据的表,我正在尝试分析该表以尝试建立关系查找。我没有使用词云软件。
我真的不知道如何解决这个问题。寻找解决方案使我找到了可能会花钱的工具,而不是编码的解决方案。
基本上我的数据如下:
CK1 CK2 Comment -------------------------------------------------------------- 1 A This is a comment. 2 A Another comment here.
这就是我需要创建的:
CK1 CK2 Words -------------------------------------------------------------- 1 A This 1 A is 1 A a 1 A comment. 2 A Another 2 A comment 2 A here.
答案 0 :(得分:0)
1)如果我们使用的是SQL Server 2016及更高版本,则可能 使用内置功能STRING_SPLIT
-- SQL 2016and above
DECLARE @txt NVARCHAR(100) = N'This is a comment.'
select [value] from STRING_SPLIT(@txt, ' ')
2)仅当1不适合时,如果分隔数(在我们的情况下为空格)小于3(适合您的样本数据),则我们应该使用PARSENAME
-- BEFORE SQL 2016 if we have less than 4 parts
DECLARE @txt NVARCHAR(100) = N'This is a comment.'
DECLARE @Temp NVARCHAR(200) = REPLACE (@txt,'.','@')
SELECT t FROM (VALUES(1),(2),(3),(4))T1(n)
CROSS APPLY (SELECT REPLACE(PARSENAME(REPLACE(@Temp,' ','.'),T1.n), '@','.'))T2(t)
3)仅当1和2不适合时,才应使用SQLCLR函数
http://dataeducation.com/sqlclr-string-splitting-part-2-even-faster-even-more-scalable/
4)仅当我们不能使用1,2并且不能使用SQLCLR(这意味着真正的管理问题,并且没有安全性,因为您可以在只读数据库中拥有所有SQLCLR函数以供所有用户使用) ,正如我在演讲中所解释的那样),那么您可以使用T-SQL并创建UDF。
https://sqlperformance.com/2012/07/t-sql-queries/split-strings
答案 1 :(得分:0)
您要执行的操作是使用空格作为定界符来标记字符串。在SQL世界中,人们通常将执行此操作的函数称为“拆分器”。对此类事物使用分隔符的潜在陷阱是如何用多个空格,制表符,CHAR(10),CHAR(13),CHAR()等分隔单词。语法较差,例如因为在句点后不添加空格会导致:
" End of sentence.Next sentence"
句子。下一步作为单词返回。
我喜欢标记人类文字的方式是:
下面是我的解决方案,后面是DDL以创建使用的功能。
-- Sample Data
DECLARE @yourtable TABLE (CK1 INT, CK2 CHAR(1), Comment VARCHAR(8000));
INSERT @yourtable (CK1, CK2, Comment)
VALUES
(1,'A','This is a typical comment...Follewed by another...'),
(2,'A','This comment has double spaces and tabs and even carriage
returns!');
-- Solution
SELECT t.CK1, t.CK2, split.itemNumber, split.itemIndex, split.itemLength, split.item
FROM @yourtable AS t
CROSS APPLY samd.patReplace(t.Comment,'[^a-zA-Z ]',' ') AS c1
CROSS APPLY samd.removeDupChar8K(c1.newString,' ') AS c2
CROSS APPLY samd.delimitedSplitAB8K(LTRIM(RTRIM(c2.NewString)),' ') AS split;
结果(为简洁起见,被删节):
CK1 CK2 itemNumber itemIndex itemLength item
----------- ---- -------------------- ----------- ----------- --------------
1 A 1 1 4 This
1 A 2 6 2 is
1 A 3 9 1 a
1 A 4 11 7 typical
1 A 5 19 7 comment
...
2 A 1 1 4 This
2 A 2 6 7 comment
2 A 3 14 3 has
2 A 4 18 6 double
...
请注意,我正在使用的拆分器基于Jeff Moden的Delimited Split8K,其运行时间为两个星期。
使用的功能:
CREATE FUNCTION dbo.rangeAB
(
@low bigint,
@high bigint,
@gap bigint,
@row1 bit
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH L1(N) AS
(
SELECT 1
FROM (VALUES
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0)) T(N) -- 90 values
),
L2(N) AS (SELECT 1 FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c),
iTally AS (SELECT rn = ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)
SELECT r.RN, r.OP, r.N1, r.N2
FROM
(
SELECT
RN = 0,
OP = (@high-@low)/@gap,
N1 = @low,
N2 = @gap+@low
WHERE @row1 = 0
UNION ALL -- COALESCE required in the TOP statement below for error handling purposes
SELECT TOP (ABS((COALESCE(@high,0)-COALESCE(@low,0))/COALESCE(@gap,0)+COALESCE(@row1,1)))
RN = i.rn,
OP = (@high-@low)/@gap+(2*@row1)-i.rn,
N1 = (i.rn-@row1)*@gap+@low,
N2 = (i.rn-(@row1-1))*@gap+@low
FROM iTally AS i
ORDER BY i.rn
) AS r
WHERE @high&@low&@gap&@row1 IS NOT NULL AND @high >= @low AND @gap > 0;
GO
CREATE FUNCTION samd.NGrams8k
(
@string VARCHAR(8000), -- Input string
@N INT -- requested token size
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
position = r.RN,
token = SUBSTRING(@string, CHECKSUM(r.RN), @N)
FROM dbo.rangeAB(1, LEN(@string)+1-@N,1,1) AS r
WHERE @N > 0 AND @N <= LEN(@string);
GO
CREATE FUNCTION samd.patReplace8K
(
@string VARCHAR(8000),
@pattern VARCHAR(50),
@replace VARCHAR(20)
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString =
(
SELECT CASE WHEN @string = CAST('' AS VARCHAR(8000)) THEN CAST('' AS VARCHAR(8000))
WHEN @pattern+@replace+@string IS NOT NULL THEN
CASE WHEN PATINDEX(@pattern,token COLLATE Latin1_General_BIN)=0
THEN ng.token ELSE @replace END END
FROM samd.NGrams8K(@string, 1) AS ng
ORDER BY ng.position
FOR XML PATH(''),TYPE
).value('text()[1]', 'VARCHAR(8000)');
GO
CREATE FUNCTION samd.delimitedSplitAB8K
(
@string VARCHAR(8000), -- input string
@delimiter CHAR(1) -- delimiter
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
itemNumber = ROW_NUMBER() OVER (ORDER BY d.p),
itemIndex = CHECKSUM(ISNULL(NULLIF(d.p+1, 0),1)),
itemLength = CHECKSUM(item.ln),
item = SUBSTRING(@string, d.p+1, item.ln)
FROM (VALUES (DATALENGTH(@string))) AS l(s) -- length of the string
CROSS APPLY
(
SELECT 0 UNION ALL -- for handling leading delimiters
SELECT ng.position
FROM samd.NGrams8K(@string, 1) AS ng
WHERE token = @delimiter
) AS d(p) -- delimiter.position
CROSS APPLY (VALUES( --LEAD(d.p, 1, l.s+l.d) OVER (ORDER BY d.p) - (d.p+l.d)
ISNULL(NULLIF(CHARINDEX(@delimiter,@string,d.p+1),0)-(d.p+1), l.s-d.p))) AS item(ln);
GO
CREATE FUNCTION dbo.RemoveDupChar8K(@string varchar(8000), @char char(1))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT NewString =
replace(replace(replace(replace(replace(replace(replace(
@string COLLATE LATIN1_GENERAL_BIN,
replicate(@char,33), @char), --33
replicate(@char,17), @char), --17
replicate(@char,9 ), @char), -- 9
replicate(@char,5 ), @char), -- 5
replicate(@char,3 ), @char), -- 3
replicate(@char,2 ), @char), -- 2
replicate(@char,2 ), @char); -- 2
GO