如何在PostgreSQL中查找重复记录

时间:2015-01-26 18:55:46

标签: sql postgresql duplicates

我有一个名为" user_links"的PostgreSQL数据库表。目前允许以下重复字段:

year, user_id, sid, cid

唯一约束是当前第一个名为" id"的字段,但我现在要添加约束以确保yearuser_id,{{1} }和sid都是唯一的但我不能应用约束,因为已存在违反此约束的重复值。

有没有办法找到所有重复项?

5 个答案:

答案 0 :(得分:229)

基本思想是使用带有计数聚合的嵌套查询:

select * from yourTable ou
where (select count(*) from yourTable inr
where inr.sid = ou.sid) > 1

您可以调整内部查询中的where子句以缩小搜索范围。


评论中提到的另一个很好的解决方案,(但不是每个人都读到它们):

select Column1, Column2, count(*)
from yourTable
group by Column1, Column2
HAVING count(*) > 1

或更短:

SELECT (yourTable.*)::text, count(*)
FROM yourTable
GROUP BY yourTable.*
HAVING count(*) > 1

答案 1 :(得分:64)

来自" Find duplicate rows with PostgreSQL"这是一个聪明的解决方案:

select * from (
  SELECT id,
  ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id asc) AS Row
  FROM tbl
) dups
where 
dups.Row > 1

答案 2 :(得分:3)

您可以在要复制的字段上加入同一个表,然后在id字段上反加入。从第一个表别名(tn1)中选择id字段,然后在第二个表别名的id字段上使用array_agg函数。最后,为了使array_agg函数正常工作,您将按tn1.id字段对结果进行分组。这将生成一个结果集,其中包含记录的id和适合连接条件的所有id的数组。

select tn1.id,
       array_agg(tn2.id) as duplicate_entries, 
from table_name tn1 join table_name tn2 on 
    tn1.year = tn2.year 
    and tn1.sid = tn2.sid 
    and tn1.user_id = tn2.user_id 
    and tn1.cid = tn2.cid
    and tn1.id <> tn2.id
group by tn1.id;

显然,对于一个id,将在duplicate_entries数组中的id也将在结果集中拥有自己的条目。您必须使用此结果集来决定您希望哪个ID成为“真相”的来源。&#39;不应删除的一条记录。也许你可以这样做:

with dupe_set as (
select tn1.id,
       array_agg(tn2.id) as duplicate_entries, 
from table_name tn1 join table_name tn2 on 
    tn1.year = tn2.year 
    and tn1.sid = tn2.sid 
    and tn1.user_id = tn2.user_id 
    and tn1.cid = tn2.cid
    and tn1.id <> tn2.id
group by tn1.id
order by tn1.id asc)
select ds.id from dupe_set ds where not exists 
 (select de from unnest(ds.duplicate_entries) as de where de < ds.id)

选择具有重复项的最低编号ID(假设ID在PK中增加)。这些将是你要保留的ID。

答案 3 :(得分:0)

为了简化操作,我假设您希望仅对列year应用唯一约束,并且主键是名为id的列。

为了找到重复的值,您应该运行

SELECT year, COUNT(id)
FROM YOUR_TABLE
GROUP BY year
HAVING COUNT(id) > 1
ORDER BY COUNT(id);

使用上面的sql语句,您将得到一个包含表中所有重复年份的表。为了删除所有重复项,除了最新的重复项,您应该使用上述sql语句。

DELETE
FROM YOUR_TABLE A USING YOUR_TABLE_AGAIN B
WHERE A.year=B.year AND A.id<B.id;

答案 4 :(得分:0)

在您的情况下,由于限制,您需要删除重复的记录。

  1. 找出重复的行
  2. import multiprocessing as mp from time import sleep from queue import Empty class ExitFlag: def __init__(self, exit_value=None): self.exit_value = exit_value #optionally pass value along with exit flag def producer_func(input_q, n_workers): for i in range(100): #100 lines of some long file print(f"put {i}") input_q.put(i) #put each line of the file to the work queue print('stopping consumers') for i in range(n_workers): input_q.put(ExitFlag()) #send shut down signal to each of the workers print('producer exiting') def consumer_func(input_q, output_q, work_func): counter = 0 while True: try: item = input_q.get(.1) #never wait forever on a "get". It's a recipe for deadlock. except Empty: continue print(f"get {item}") if isinstance(item, ExitFlag): break else: counter += 1 output_q.put(work_func(item)) output_q.put(ExitFlag(exit_value=counter)) print('consumer exiting') def work_func(number): sleep(.1) #some heavy nltk work... return number*2 if __name__ == '__main__': input_q = mp.Queue(maxsize=10) #only bother limiting size if you have memory usage constraints output_q = mp.Queue(maxsize=10) n_workers = mp.cpu_count() producer = mp.Process(target=producer_func, args=(input_q, n_workers)) #generate the input from another process. (this could just as easily be a thread as it seems it will be IO limited anyway) producer.start() consumers = [mp.Process(target=consumer_func, args=(input_q, output_q, work_func)) for _ in range(n_workers)] for c in consumers: c.start() total = 0 stop_signals = 0 exit_values = [] while True: try: item = output_q.get(.1) except Empty: continue if isinstance(item, ExitFlag): stop_signals += 1 if item.exit_value is not None: exit_values.append(item.exit_value) #do something with the return at the end if stop_signals >= n_workers: #stop waiting for more results once all consumers finish break else: total += item #do something with the incremental return values print(total) print(exit_values) #cleanup producer.join() print("producer joined") for c in consumers: c.join() print("consumers joined") 日期组织它们 - 在这种情况下,我保留最旧的
  3. created_at 删除记录以过滤正确的行
USING