如何清理join_table并删除重复的条目?

时间:2013-03-12 22:44:36

标签: ruby-on-rails ruby-on-rails-3 postgresql optimization has-and-belongs-to-many

我有2个模型 - QuestionTag - 它们之间有HABTM,它们共享一个联接表questions_tags

大饱眼福这个坏孩子:

1.9.3p392 :011 > Question.count
   (852.1ms)  SELECT COUNT(*) FROM "questions" 
 => 417 
1.9.3p392 :012 > Tag.count
   (197.8ms)  SELECT COUNT(*) FROM "tags" 
 => 601 
1.9.3p392 :013 > Question.connection.execute("select count(*) from questions_tags").first["count"].to_i
   (648978.7ms)  select count(*) from questions_tags
 => 39919778 

我假设questions_tags连接表包含一堆重复记录 - 否则,我不知道它为什么会这么大。

如何清理该联接表,使其只有uniq个内容?或者我如何检查那里是否有重复的记录?

修改1

我正在使用PostgreSQL,这是join_table questions_tags

的架构
  create_table "questions_tags", :id => false, :force => true do |t|
    t.integer "question_id"
    t.integer "tag_id"
  end

  add_index "questions_tags", ["question_id"], :name => "index_questions_tags_on_question_id"
  add_index "questions_tags", ["tag_id"], :name => "index_questions_tags_on_tag_id"

2 个答案:

答案 0 :(得分:2)

我将此添加为新答案,因为它与我的上一次有很大不同。这个假设您没有在连接表上有id列。这将创建一个新表,选择唯一的行,然后删除旧表并重命名新表。这将比涉及子选择的任何事情快得多。

foo=# select * from questions_tags;
 question_id | tag_id
-------------+--------
           1 |      2
           2 |      1
           2 |      2
           1 |      1
           1 |      1
(5 rows)

foo=# select distinct question_id, tag_id into questions_tags_tmp from questions_tags;
SELECT 4
foo=# select * from questions_tags_tmp;
 question_id | tag_id
-------------+--------
           2 |      2
           1 |      2
           2 |      1
           1 |      1
(4 rows)

foo=# drop table questions_tags;
DROP TABLE
foo=# alter table questions_tags_tmp rename to questions_tags;
ALTER TABLE
foo=# select * from questions_tags;
 question_id | tag_id
-------------+--------
           2 |      2
           1 |      2
           2 |      1
           1 |      1
(4 rows)

答案 1 :(得分:1)

删除带有错误标记引用的标记关联

DELETE  FROM questions_tags
WHERE   NOT EXISTS ( SELECT  1 
                 FROM    tags
                 WHERE   tags.id = questions_tags.tag_id);

删除带有错误问题参考的标记关联

DELETE  FROM questions_tags
WHERE   NOT EXISTS ( SELECT  1 
                 FROM    questions
                 WHERE   questions.id = questions_tags.question_id);

删除重复的标记关联

DELETE  FROM questions_tags
USING   ( SELECT qt3.user_id, qt3.question_id, MIN(qt3.id) id
          FROM   questions_tags qt3
          GROUP BY qt3.user_id, qt3.question_id
        ) qt2
WHERE   questions_tags.user_id=qt2.user_id AND 
        questions_tags.question_id=qt2.question_id AND
        questions_tags.id != qt2.id

注意:

请先在开发环境中测试SQL,然后再在生产环境中进行测试。