我构建了一个查询,查找列的最长公共子字符串并按频率对它们进行排序。我遇到的问题是删除/分组类似的结果。
这是以下代码中的TOP 5输出 - 请注意“我喜欢连指手套猫”是最长,最频繁的字符串,但代码也会找到该字符串的所有子公司,例如“我喜欢连指手套”或“我喜欢连指手套”。
I love Mittens the cat 3
I love Mittens the ca 3
love Mittens the cat 3
love Mittens the ca 3
I love Mittens the c 3
如果可能,我想删除任何与其他具有部分单词的子串相似的子串。第3行会很好,因为它是全字,但第4和第5行应该被删除,因为它们与第1行相似。
DECLARE @MinLength INT = 5 --Minimum Substring Length
DECLARE @MaxLength INT = 50 --Maximum Substring Length
DECLARE @Delimeter VARCHAR(5) = ' '
DECLARE @T TABLE
(
ID INT IDENTITY
, chvStrings VARCHAR(64)
)
INSERT INTO @T VALUES
('I like cats'),
('I like dogs'),
('cats are great'),
('look at that cat'),
('I love Mittens the cat'),
('I love Mittens the cat a lot'),
('I love Mittens the cat so much'),
('Dogs are okay, I guess...')
SELECT TOP 10000
SUBSTRING(T.chvStrings, N.Number, M.Number) AS Word,
COUNT(M.number) AS [Count]
FROM
@T as T
CROSS APPLY
(SELECT N.Number
FROM [master]..spt_values as N
WHERE N.type = 'P'
AND N.number BETWEEN 1 AND LEN(T.chvStrings)) N
CROSS APPLY
(SELECT N.Number
FROM [master]..spt_values as N
WHERE N.type = 'P'
AND N.number BETWEEN @MinLength AND @MaxLength) M
WHERE
N.number <= LEN(t.chvStrings) - M.number + 1
AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '% '
AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '%[_]%'
AND (SUBSTRING(T.chvStrings, N.Number,1) = @Delimeter OR N.number = 1)
GROUP BY
SUBSTRING(T.chvStrings, N.Number, M.Number)
ORDER BY
COUNT(T.chvStrings) DESC,
LEN(SUBSTRING(T.chvStrings, N.Number, M.Number)) DESC
答案 0 :(得分:1)
我添加了一些额外的过滤器,说子串N.Number-1不能包含字母[a-z0-9],类似子串M.Number + 1不能是[a-z0-9 ]
这就是你需要的。修改后的代码:
val N = 100
val NThreads = 5
(0 until N).par.foreach(NThreads, i => {
// do something
})