SQL SERVER - 查找XML值的重复项

时间:2014-08-06 19:53:56

标签: sql-server xml

我有一个包含2列的电影表。 ID(int)和MetaData(XML)。 MetaData如下所示:

<movie xmlns="urn:schemas-xxx:yyy:catalog" >
  <credits>
    <credit creditId="15594954" creditType="Actor" >aaa</credit>
    <credit creditId="15573106" creditType="Actor" >bbb</credit>
    <credit creditId="15781056" creditType="Actor" >bbb</credit>
    <credit creditId="15781056" creditType="Actor" >ddd</credit>
    <credit creditId="15606109" creditType="Director" >ddd</credit>
    <credit creditId="16316911" creditType="Art Director" >adadad</credit>
    <credit creditId="18484117" creditType="Choreographer" >ch</credit>
    <credit creditId="15707268" creditType="Cinematographer" >cm</credit>
    <credit creditId="15907445" creditType="Screenwriter">sss</credit>
    <credit creditId="15905546" creditType="Screenwriter" >ggg</credit>
    <credit creditId="16493602" creditType="Editor" >eee</credit>
    <credit creditId="15825749" creditType="Composer" >ccc</credit>
    <credit creditId="18486706" creditType="Composer" >ddd</credit>
  </credits>
</movie>

我想找到信用类型中有重复项的记录 - 这里的演员“bbb”是重复的(但“ddd”不是)。

如果我有如下的查询,它甚至会抛出记录,其中演员也是导演。但我不希望它们出现。

-- Check for Duplicate Cast and Crew
WITH XMLNAMESPACES (DEFAULT 'urn:schemas-xxx:yyy:catalog')
SELECT Count(*)
FROM Movie
WHERE Metadata.value('count(/movie/credits/credit)', 'int') <> Metadata.value('count(distinct-values(/movie/credits/credit))', 'int')

如果我像下面那样修改我的查询,它就可以了。

WITH XMLNAMESPACES (DEFAULT 'urn:schemas-xxx:yyy:catalog')
SELECT Count(*)
FROM Movie
WHERE 
 (
    (Metadata.value('count(/movie/credits/credit[@creditType="Actor"])', 'int') <> 
        Metadata.value('count(distinct-values(/movie/credits/credit[@creditType="Actor"]))', 'int')
        )

    OR (Metadata.value('count(/movie/credits/credit[@creditType="Director"])', 'int') <> 
        Metadata.value('count(distinct-values(/movie/credits/credit[@creditType="Director"]))', 'int')
        )
     OR (Metadata.value('count(/movie/credits/credit[@creditType="Producer"])', 'int') <> 
        Metadata.value('count(distinctvalues(/movie/credits/credit[@creditType="Producer"]))', 'int')
        )
)

但是有很多信用类型,如作曲家,编辑等,我不希望这种方式对每种信用类型都这样做。 有没有有效的方法来做到这一点?

更新

我发现之前的查询做了区分大小写的搜索。我需要一个不区分大小写的,所以改变它如下所示:

WITH XMLNAMESPACES (DEFAULT 'urn:xxx:yyy:catalog')
SELECT Count(*) FROM
(
SELECT  ID
FROM Movie
CROSS APPLY
Movie.Metadata.nodes('/movie/credits/credit[@creditType="Actor"]') x(y)
GROUP BY ID
HAVING 
 COUNT(y.value('.', 'varchar(100)')) <> COUNT(Distinct y.value('.', 'varchar(100)'))
) AS temp;

但我原来的问题仍然存在。

1 个答案:

答案 0 :(得分:1)

您可以使用FLOWER并检查@creditType的每个不同值的计数。返回一个虚节点,使用exist()检查节点是否存在。

with xmlnamespaces(default 'urn:schemas-xxx:yyy:catalog')
select count(*)
from Movie as M
where M.Metadata.exist('
  for $creditType in distinct-values(/movie/credits/credit/@creditType)
  where count(distinct-values(/movie/credits/credit[@creditType = $creditType]/text())) != count(/movie/credits/credit[@creditType = $creditType]/text())
  return <X/>') = 1;

SQL-Fiddle