将父母分组包含同一组孩子

时间:2010-07-21 15:05:25

标签: sql sql-server-2005 sql-server-2008

我有这种亲子关系

Paragraph
---------
ParagraphID   PK
// other attributes ...


Sentence
--------
SentenceID    PK
ParagraphID   FK -> Paragraph.ParagraphID
Text         nvarchar(4000)
Offset       int
Score        int
// other attributes ...

我想找到相同的段落;这是包含相同句子集的段落。如果两个句子具有相同的文本,偏移和分数,则认为它们是相同的 - SentenceID / ParagraphID不是比较的一部分,如果它们包含相同的句子集,则两个句子是等效的。

有人可以告诉我查找相同段落的查询是什么样的吗?

编辑:有大约150K段,和1.5M句子。输出应包括ParagraphID,以及与此相当的最低段落ID。例如。如果paragraph1和paragraph2相等,那么输出将是

ParagraphID  EquivParagraphID
1            1
2            1

2 个答案:

答案 0 :(得分:1)

简而言之,您需要为每个段落签名,然后比较签名。你没有提到输出本身的性质。在这里,我为每个相同的段落签名返回一行以逗号分隔的ParagraphId值。

With ParagraphSigs As
    (
    Select P.ParagraphId
        , Hashbytes('SHA1'
                ,   (
                    Select '|' + S1.Text 
                        '|' + Cast(S1.Offset As varchar(10)) 
                        '|' + Cast(S1.Score As varchar(10))
                    From Sentence As S1
                    Where S1.ParagraphId = P.ParagraphId
                    Order By S1.SentenceId
                    For Xml Path('')
                    )) As Signature
    From Paragraph As P
    )
Select Stuff(
            (
            Select ', ' + Cast(PS1.ParagraphId As varchar(10))
            From ParagraphSigs As PS1
            Where PS1.Signature = PS.Signature
            For Xml Path('')
            ), 1, 2, '') As Paragraph
From ParagraphSigs As PS
Group By PS.Signature

如果您添加了有关所需输出的内容,则可以像这样更改查询:

With ParagraphSigs As
    (
    Select P.ParagraphId
        , Hashbytes('SHA1'
                ,   (
                    Select '|' + S1.Text 
                        '|' + Cast(S1.Offset As varchar(10)) 
                        '|' + Cast(S1.Score As varchar(10))
                    From Sentence As S1
                    Where S1.ParagraphId = P.ParagraphId
                    Order By S1.SentenceId
                    For Xml Path('')
                    )) As Signature
    From Paragraph As P
    )
Select P1.ParagraphId, P2.ParagraphId As EquivParagraphId
From ParagraphSigs As P1
    Left Join ParagraphSigs As P2
        On P2.Signature = P1.Signature
            And P2.ParagraphId <> P1.ParagraphId

显然,三个或四个段落可能共享相同的签名,因此请注意上述结果将为您提供匹配段落的笛卡尔积。 (例如(P1,P2),(P1,P3),(P2,P1),(P2,P3),(P3,P1),(P3,P2))。

在评论中你问到最后有效搜索句子。由于您有两个其他参数,您可以通过首先比较两个int列来减少通过执行生成的签名数量:

With ParagraphsNeedingSigs As
    (
    Select P1.ParagraphId
    From Paragraph As P1
    Where Exists    (
                    Select 1
                    From Paragraph As P2
                    Where P2.ParagraphId <> P1.ParagraphId
                        And P2.Offset = P1.Offet
                        And P2.Score = P1.Score
                    )
    )
    , ParagraphSigs As
    (
    Select P.ParagraphId
        , Hashbytes('SHA1'
                ,   (
                    Select '|' + S1.Text 
                        '|' + Cast(S1.Offset As varchar(10)) 
                        '|' + Cast(S1.Score As varchar(10))
                    From Sentence As S1
                    Where S1.ParagraphId = P.ParagraphId
                    Order By S1.SentenceId
                    For Xml Path('')
                    )) As Signature
    From ParagraphsNeedingSigs As P
    )
Select P.ParagraphId, P2.ParagraphId As EquivParagraphId
From Paragraph As P
    Left Join ParagraphSigs As P1
        On P1.ParagraphId = P.ParagraphId
    Left Join ParagraphSigs As P2
        On P2.Signature = P1.Signature
            And P2.ParagraphId <> P1.ParagraphId

答案 1 :(得分:1)

由于您已列出SQL 2008(我不确定此语法是否在2005年可用),您可以使用EXCEPT或INTERSECT比较。它涉及相关子查询,因此性能可能是一个问题。

SELECT
    *
FROM
    Paragraph P
WHERE
    (SELECT COUNT(*) FROM 
(
    SELECT
        S1.[Text],
        S1.Offset,
        S1.Score
    FROM
        Paragraph P1
    INNER JOIN Sentence S1 ON
        S1.ParagraphID = P1.ParagraphID
    WHERE
        P1.ParagraphID = P.ParagraphID
    INTERSECT
    SELECT
        S2.[Text],
        S2.Offset,
        S2.Score
    FROM
        Paragraph P2
    INNER JOIN Sentence S2 ON
        S2.ParagraphID = P2.ParagraphID
    WHERE
        P2.ParagraphID > P.ParagraphID
) SQ
) = (SELECT COUNT(*) FROM Sentence P3 WHERE P3.ParagraphID = P.ParagraphID)