Question

我有一个名为＆＃34; user_links＆＃34;的PostgreSQL数据库表。目前允许以下重复字段：

year, user_id, sid, cid

唯一约束是当前第一个名为＆＃34; id＆＃34;的字段，但我现在要添加约束以确保year，user_id，{{1} }和sid都是唯一的但我不能应用约束，因为已存在违反此约束的重复值。

有没有办法找到所有重复项？

Answer 1

基本思想是使用带有计数聚合的嵌套查询：

select * from yourTable ou
where (select count(*) from yourTable inr
where inr.sid = ou.sid) > 1

您可以调整内部查询中的where子句以缩小搜索范围。

评论中提到的另一个很好的解决方案，（但不是每个人都读到它们）：

select Column1, Column2, count(*)
from yourTable
group by Column1, Column2
HAVING count(*) > 1

或更短：

SELECT (yourTable.*)::text, count(*)
FROM yourTable
GROUP BY yourTable.*
HAVING count(*) > 1

Answer 2

来自＆＃34; Find duplicate rows with PostgreSQL＆＃34;这是一个聪明的解决方案：

select * from (
  SELECT id,
  ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id asc) AS Row
  FROM tbl
) dups
where 
dups.Row > 1

Answer 3

您可以在要复制的字段上加入同一个表，然后在id字段上反加入。从第一个表别名（tn1）中选择id字段，然后在第二个表别名的id字段上使用array_agg函数。最后，为了使array_agg函数正常工作，您将按tn1.id字段对结果进行分组。这将生成一个结果集，其中包含记录的id和适合连接条件的所有id的数组。

select tn1.id,
       array_agg(tn2.id) as duplicate_entries, 
from table_name tn1 join table_name tn2 on 
    tn1.year = tn2.year 
    and tn1.sid = tn2.sid 
    and tn1.user_id = tn2.user_id 
    and tn1.cid = tn2.cid
    and tn1.id <> tn2.id
group by tn1.id;

显然，对于一个id，将在duplicate_entries数组中的id也将在结果集中拥有自己的条目。您必须使用此结果集来决定您希望哪个ID成为“真相”的来源。＆＃39;不应删除的一条记录。也许你可以这样做：

with dupe_set as (
select tn1.id,
       array_agg(tn2.id) as duplicate_entries, 
from table_name tn1 join table_name tn2 on 
    tn1.year = tn2.year 
    and tn1.sid = tn2.sid 
    and tn1.user_id = tn2.user_id 
    and tn1.cid = tn2.cid
    and tn1.id <> tn2.id
group by tn1.id
order by tn1.id asc)
select ds.id from dupe_set ds where not exists 
 (select de from unnest(ds.duplicate_entries) as de where de < ds.id)

选择具有重复项的最低编号ID（假设ID在PK中增加）。这些将是你要保留的ID。

Answer 4

为了简化操作，我假设您希望仅对列year应用唯一约束，并且主键是名为id的列。

为了找到重复的值，您应该运行

SELECT year, COUNT(id)
FROM YOUR_TABLE
GROUP BY year
HAVING COUNT(id) > 1
ORDER BY COUNT(id);

使用上面的sql语句，您将得到一个包含表中所有重复年份的表。为了删除所有重复项，除了最新的重复项，您应该使用上述sql语句。

DELETE
FROM YOUR_TABLE A USING YOUR_TABLE_AGAIN B
WHERE A.year=B.year AND A.id<B.id;

Answer 5

在您的情况下，由于限制，您需要删除重复的记录。

找出重复的行
按 import multiprocessing as mp from time import sleep from queue import Empty class ExitFlag: def __init__(self, exit_value=None): self.exit_value = exit_value #optionally pass value along with exit flag def producer_func(input_q, n_workers): for i in range(100): #100 lines of some long file print(f"put {i}") input_q.put(i) #put each line of the file to the work queue print('stopping consumers') for i in range(n_workers): input_q.put(ExitFlag()) #send shut down signal to each of the workers print('producer exiting') def consumer_func(input_q, output_q, work_func): counter = 0 while True: try: item = input_q.get(.1) #never wait forever on a "get". It's a recipe for deadlock. except Empty: continue print(f"get {item}") if isinstance(item, ExitFlag): break else: counter += 1 output_q.put(work_func(item)) output_q.put(ExitFlag(exit_value=counter)) print('consumer exiting') def work_func(number): sleep(.1) #some heavy nltk work... return number*2 if __name__ == '__main__': input_q = mp.Queue(maxsize=10) #only bother limiting size if you have memory usage constraints output_q = mp.Queue(maxsize=10) n_workers = mp.cpu_count() producer = mp.Process(target=producer_func, args=(input_q, n_workers)) #generate the input from another process. (this could just as easily be a thread as it seems it will be IO limited anyway) producer.start() consumers = [mp.Process(target=consumer_func, args=(input_q, output_q, work_func)) for _ in range(n_workers)] for c in consumers: c.start() total = 0 stop_signals = 0 exit_values = [] while True: try: item = output_q.get(.1) except Empty: continue if isinstance(item, ExitFlag): stop_signals += 1 if item.exit_value is not None: exit_values.append(item.exit_value) #do something with the return at the end if stop_signals >= n_workers: #stop waiting for more results once all consumers finish break else: total += item #do something with the incremental return values print(total) print(exit_values) #cleanup producer.join() print("producer joined") for c in consumers: c.join() print("consumers joined") 日期组织它们 - 在这种情况下，我保留最旧的
用 created_at 删除记录以过滤正确的行

USING

如何在PostgreSQL中查找重复记录

5 个答案: