删除重复行但保持多对多关系

时间:2017-08-31 07:31:03

标签: mysql sql duplicates

我将来自我的Tensorflow图像分类器的结果保存在SQL数据库中。我有3张桌子。图像,类别和将两者连接到权重变量的表格。有些图像没有关系,有些图像有很多。

问题是我在图像表中有重复的行需要删除。但是,如果复制的图像有一个或多个,我需要保留多对多的关系。

以下是一个例子:

表名: my_images

+----+------------+-----------------+
| ID | image_path | image_filename  |
+----+------------+-----------------+
|  1 | Film 1     | Film 1 001.jpg  |
|  2 | Film 1     | Film 1 001.jpg  |
|  3 | Film 1     | Film 1 002.jpg  |
|  4 | Film 1     | Film 1 002.jpg  |
|  5 | Film 1     | Film 1 003.jpg  |
|  6 | Film 1     | Film 1 003.jpg  |
+----+------------+-----------------+

表名: my_terms

+---------+------------+
| term_id | term_name  |
+---------+------------+
|       1 | cat        |
|       2 | dog        |
|       3 | automobile |
+---------+------------+

表名: my_term_relationships

+----------+---------+---------+
| image_id | term_id | weight  |
+----------+---------+---------+
|        2 |       1 | 0.58516 |
|        2 |       3 | 0.16721 |
|        3 |       2 | 0.21475 |
+----------+---------+---------+

所以在这个例子中,理想的结果是从 my_images 中删除第1,4行和第5行或第6行。

2 个答案:

答案 0 :(得分:1)

逐步接近此步骤。

首先,找到重复的条目:

SELECT 
image_path, image_filename
FROM my_images
GROUP BY image_path, image_filename
HAVING COUNT(*) > 1

其次,获取所有重复的行:

SELECT mi.*
FROM my_images mi
JOIN (
    SELECT 
    image_path, image_filename
    FROM my_images
    GROUP BY image_path, image_filename
    HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename

最后,获取不要删除的ID。

SELECT MIN(ID)
FROM my_images mi
JOIN (
    SELECT 
    image_path, image_filename
    FROM my_images
    GROUP BY image_path, image_filename
    HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
LEFT JOIN my_term_relationships mtr ON mi.ID = mtr.image_id
WHERE mtr.image_id IS NULL 
GROUP BY mi.image_path, mi.image_filename
HAVING COUNT(*) > 0

检查一切是否正确。如果是,请将其转换为删除语句。

DELETE my_images.* FROM my_images 
JOIN (
SELECT MIN(ID) AS ID
FROM my_images mi
JOIN (
    SELECT 
    image_path, image_filename
    FROM my_images
    GROUP BY image_path, image_filename
    HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
LEFT JOIN my_term_relationships mtr ON mi.ID = mtr.image_id
WHERE mtr.image_id IS NULL 
GROUP BY mi.image_path, mi.image_filename
HAVING COUNT(*) > 0
) sq USING(ID);

编辑:为了解决Logar提到的问题,在DELETE语句之前使用此UPDATE语句。

UPDATE my_term_relationships mtr 
JOIN (
    SELECT mi.ID, minID
    FROM my_images mi
    JOIN (
        SELECT 
        image_path, image_filename, MIN(ID) AS minID
        FROM my_images
        GROUP BY image_path, image_filename
        HAVING COUNT(*) > 1
    ) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
) sq ON mtr.image_id = sq.ID 
SET mtr.image_id = sq.minID;

答案 1 :(得分:0)

您需要查询两组图像ID,并使用它们进行过滤。假设image_pathimage_filename一起是唯一的:

  1. 所有my_images ID,my_term_relationships未引用,但可能会引用相应的image_path + image_filename
  2. 唯一ID,属于image_path + image_filename对,my_term_relationships中完全没有引用。
  3. 看看这个查询:

    DELETE FROM my_images 
    WHERE 
       ID NOT IN (SELECT DISTINCT image_id FROM my_term_relationships) -- 1
       AND 
       ID NOT IN (SELECT id FROM (
        SELECT MIN(ID) as id
        FROM my_images 
        LEFT JOIN my_term_relationships ON ID = image_id
         GROUP BY image_path,image_filename
         HAVING COUNT(image_id) = 0
       ) as u_ids -- 2
    );
    

    请注意,您必须将my_images表包装在子查询中的DELETE的WHERE子句中。阅读此主题以获得解释:Can't specify target table for update in FROM clause

    示例:sqlfiddle

    用于从my_term_relationships删除重复行的更新查询示例:

    UPDATE my_term_relationships
    SET image_id = (
      select min(my_images.ID) 
      from my_images 
      join my_images as ref_image on (my_images.image_path = ref_image.image_path and my_images.image_filename = ref_image.image_filename)
      where ref_image.ID = image_id
     );