我有这种亲子关系
Paragraph
---------
ParagraphID PK
// other attributes ...
Sentence
--------
SentenceID PK
ParagraphID FK -> Paragraph.ParagraphID
Text nvarchar(4000)
Offset int
Score int
// other attributes ...
我想找到相同的段落;这是包含相同句子集的段落。如果两个句子具有相同的文本,偏移和分数,则认为它们是相同的 - SentenceID / ParagraphID不是比较的一部分,如果它们包含相同的句子集,则两个句子是等效的。
有人可以告诉我查找相同段落的查询是什么样的吗?
编辑:有大约150K段,和1.5M句子。输出应包括ParagraphID,以及与此相当的最低段落ID。例如。如果paragraph1和paragraph2相等,那么输出将是ParagraphID EquivParagraphID
1 1
2 1
答案 0 :(得分:1)
简而言之,您需要为每个段落签名,然后比较签名。你没有提到输出本身的性质。在这里,我为每个相同的段落签名返回一行以逗号分隔的ParagraphId值。
With ParagraphSigs As
(
Select P.ParagraphId
, Hashbytes('SHA1'
, (
Select '|' + S1.Text
'|' + Cast(S1.Offset As varchar(10))
'|' + Cast(S1.Score As varchar(10))
From Sentence As S1
Where S1.ParagraphId = P.ParagraphId
Order By S1.SentenceId
For Xml Path('')
)) As Signature
From Paragraph As P
)
Select Stuff(
(
Select ', ' + Cast(PS1.ParagraphId As varchar(10))
From ParagraphSigs As PS1
Where PS1.Signature = PS.Signature
For Xml Path('')
), 1, 2, '') As Paragraph
From ParagraphSigs As PS
Group By PS.Signature
如果您添加了有关所需输出的内容,则可以像这样更改查询:
With ParagraphSigs As
(
Select P.ParagraphId
, Hashbytes('SHA1'
, (
Select '|' + S1.Text
'|' + Cast(S1.Offset As varchar(10))
'|' + Cast(S1.Score As varchar(10))
From Sentence As S1
Where S1.ParagraphId = P.ParagraphId
Order By S1.SentenceId
For Xml Path('')
)) As Signature
From Paragraph As P
)
Select P1.ParagraphId, P2.ParagraphId As EquivParagraphId
From ParagraphSigs As P1
Left Join ParagraphSigs As P2
On P2.Signature = P1.Signature
And P2.ParagraphId <> P1.ParagraphId
显然,三个或四个段落可能共享相同的签名,因此请注意上述结果将为您提供匹配段落的笛卡尔积。 (例如(P1,P2),(P1,P3),(P2,P1),(P2,P3),(P3,P1),(P3,P2))。
在评论中你问到最后有效搜索句子。由于您有两个其他参数,您可以通过首先比较两个int列来减少通过执行生成的签名数量:
With ParagraphsNeedingSigs As
(
Select P1.ParagraphId
From Paragraph As P1
Where Exists (
Select 1
From Paragraph As P2
Where P2.ParagraphId <> P1.ParagraphId
And P2.Offset = P1.Offet
And P2.Score = P1.Score
)
)
, ParagraphSigs As
(
Select P.ParagraphId
, Hashbytes('SHA1'
, (
Select '|' + S1.Text
'|' + Cast(S1.Offset As varchar(10))
'|' + Cast(S1.Score As varchar(10))
From Sentence As S1
Where S1.ParagraphId = P.ParagraphId
Order By S1.SentenceId
For Xml Path('')
)) As Signature
From ParagraphsNeedingSigs As P
)
Select P.ParagraphId, P2.ParagraphId As EquivParagraphId
From Paragraph As P
Left Join ParagraphSigs As P1
On P1.ParagraphId = P.ParagraphId
Left Join ParagraphSigs As P2
On P2.Signature = P1.Signature
And P2.ParagraphId <> P1.ParagraphId
答案 1 :(得分:1)
由于您已列出SQL 2008(我不确定此语法是否在2005年可用),您可以使用EXCEPT或INTERSECT比较。它涉及相关子查询,因此性能可能是一个问题。
SELECT
*
FROM
Paragraph P
WHERE
(SELECT COUNT(*) FROM
(
SELECT
S1.[Text],
S1.Offset,
S1.Score
FROM
Paragraph P1
INNER JOIN Sentence S1 ON
S1.ParagraphID = P1.ParagraphID
WHERE
P1.ParagraphID = P.ParagraphID
INTERSECT
SELECT
S2.[Text],
S2.Offset,
S2.Score
FROM
Paragraph P2
INNER JOIN Sentence S2 ON
S2.ParagraphID = P2.ParagraphID
WHERE
P2.ParagraphID > P.ParagraphID
) SQ
) = (SELECT COUNT(*) FROM Sentence P3 WHERE P3.ParagraphID = P.ParagraphID)