Postgresql:标记和识别重复项

时间:2015-07-08 18:55:34

标签: postgresql duplicates

我试图找到一种方法来标记与此question类似的重复案例。

但是,我不想计算重复值的出现次数,而是将它们标记为01,分别用于重复和唯一的情况。这与SPSS识别重复案例功能非常相似。例如,如果我有一个数据集,如:

Name    State    Gender
John     TX        M
Katniss  DC        F
Noah     CA        M
Katniss  CA        F
John     SD        M
Ariel    FL        F     

如果我想标记具有重复名称的那些,那么输出将是这样的:

Name    State    Gender   Dup
John     TX        M       1
Katniss  DC        F       1 
Noah     CA        M       1
Katniss  CA        F       0
John     SD        M       0
Ariel    FL        F       1

奖金将是一个查询语句,它将在确定唯一案例时处理要选择的案例。

1 个答案:

答案 0 :(得分:1)

SELECT name, state, gender
    , NOT EXISTS (SELECT 1 FROM names nx
            WHERE nx.name = na.name
              AND nx.gender = na.gender
              AND nx.ctid < na.ctid) AS Is_not_a_dup
FROM names na
   ;

说明:[NOT] EXISTS(...)产生一个布尔值(可以转换为整数)。转换为布尔值需要额外的一对(),但是:

SELECT name, state, gender
        , (NOT EXISTS (SELECT 1 FROM names nx
                WHERE nx.name = na.name
                  AND nx.gender = na.gender
                  AND nx.ctid < na.ctid))::integer AS is_not_a_dup
FROM names na
       ;

结果:

DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 6
  name   | state | gender | nodup 
---------+-------+--------+-------
 John    | TX    | M      | t
 Katniss | DC    | F      | t
 Noah    | CA    | M      | t
 Katniss | CA    | F      | f
 John    | SD    | M      | f
 Ariel   | FL    | F      | t
(6 rows)

  name   | state | gender | nodup 
---------+-------+--------+-------
 John    | TX    | M      |     1
 Katniss | DC    | F      |     1
 Noah    | CA    | M      |     1
 Katniss | CA    | F      |     0
 John    | SD    | M      |     0
 Ariel   | FL    | F      |     1
(6 rows)