我有一个名为" user_links"的PostgreSQL数据库表。目前允许以下重复字段:
year, user_id, sid, cid
唯一约束是当前第一个名为" id"的字段,但我现在要添加约束以确保year
,user_id
,{{1} }和sid
都是唯一的但我不能应用约束,因为已存在违反此约束的重复值。
有没有办法找到所有重复项?
答案 0 :(得分:229)
基本思想是使用带有计数聚合的嵌套查询:
select * from yourTable ou
where (select count(*) from yourTable inr
where inr.sid = ou.sid) > 1
您可以调整内部查询中的where子句以缩小搜索范围。
评论中提到的另一个很好的解决方案,(但不是每个人都读到它们):
select Column1, Column2, count(*)
from yourTable
group by Column1, Column2
HAVING count(*) > 1
或更短:
SELECT (yourTable.*)::text, count(*)
FROM yourTable
GROUP BY yourTable.*
HAVING count(*) > 1
答案 1 :(得分:64)
来自" Find duplicate rows with PostgreSQL"这是一个聪明的解决方案:
select * from (
SELECT id,
ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id asc) AS Row
FROM tbl
) dups
where
dups.Row > 1
答案 2 :(得分:3)
您可以在要复制的字段上加入同一个表,然后在id字段上反加入。从第一个表别名(tn1)中选择id字段,然后在第二个表别名的id字段上使用array_agg函数。最后,为了使array_agg函数正常工作,您将按tn1.id字段对结果进行分组。这将生成一个结果集,其中包含记录的id和适合连接条件的所有id的数组。
select tn1.id,
array_agg(tn2.id) as duplicate_entries,
from table_name tn1 join table_name tn2 on
tn1.year = tn2.year
and tn1.sid = tn2.sid
and tn1.user_id = tn2.user_id
and tn1.cid = tn2.cid
and tn1.id <> tn2.id
group by tn1.id;
显然,对于一个id,将在duplicate_entries数组中的id也将在结果集中拥有自己的条目。您必须使用此结果集来决定您希望哪个ID成为“真相”的来源。&#39;不应删除的一条记录。也许你可以这样做:
with dupe_set as (
select tn1.id,
array_agg(tn2.id) as duplicate_entries,
from table_name tn1 join table_name tn2 on
tn1.year = tn2.year
and tn1.sid = tn2.sid
and tn1.user_id = tn2.user_id
and tn1.cid = tn2.cid
and tn1.id <> tn2.id
group by tn1.id
order by tn1.id asc)
select ds.id from dupe_set ds where not exists
(select de from unnest(ds.duplicate_entries) as de where de < ds.id)
选择具有重复项的最低编号ID(假设ID在PK中增加)。这些将是你要保留的ID。
答案 3 :(得分:0)
为了简化操作,我假设您希望仅对列year应用唯一约束,并且主键是名为id的列。
为了找到重复的值,您应该运行
SELECT year, COUNT(id)
FROM YOUR_TABLE
GROUP BY year
HAVING COUNT(id) > 1
ORDER BY COUNT(id);
使用上面的sql语句,您将得到一个包含表中所有重复年份的表。为了删除所有重复项,除了最新的重复项,您应该使用上述sql语句。
DELETE
FROM YOUR_TABLE A USING YOUR_TABLE_AGAIN B
WHERE A.year=B.year AND A.id<B.id;
答案 4 :(得分:0)
在您的情况下,由于限制,您需要删除重复的记录。
import multiprocessing as mp
from time import sleep
from queue import Empty
class ExitFlag:
def __init__(self, exit_value=None):
self.exit_value = exit_value #optionally pass value along with exit flag
def producer_func(input_q, n_workers):
for i in range(100): #100 lines of some long file
print(f"put {i}")
input_q.put(i) #put each line of the file to the work queue
print('stopping consumers')
for i in range(n_workers):
input_q.put(ExitFlag()) #send shut down signal to each of the workers
print('producer exiting')
def consumer_func(input_q, output_q, work_func):
counter = 0
while True:
try:
item = input_q.get(.1) #never wait forever on a "get". It's a recipe for deadlock.
except Empty:
continue
print(f"get {item}")
if isinstance(item, ExitFlag):
break
else:
counter += 1
output_q.put(work_func(item))
output_q.put(ExitFlag(exit_value=counter))
print('consumer exiting')
def work_func(number):
sleep(.1) #some heavy nltk work...
return number*2
if __name__ == '__main__':
input_q = mp.Queue(maxsize=10) #only bother limiting size if you have memory usage constraints
output_q = mp.Queue(maxsize=10)
n_workers = mp.cpu_count()
producer = mp.Process(target=producer_func, args=(input_q, n_workers)) #generate the input from another process. (this could just as easily be a thread as it seems it will be IO limited anyway)
producer.start()
consumers = [mp.Process(target=consumer_func, args=(input_q, output_q, work_func)) for _ in range(n_workers)]
for c in consumers: c.start()
total = 0
stop_signals = 0
exit_values = []
while True:
try:
item = output_q.get(.1)
except Empty:
continue
if isinstance(item, ExitFlag):
stop_signals += 1
if item.exit_value is not None:
exit_values.append(item.exit_value) #do something with the return at the end
if stop_signals >= n_workers: #stop waiting for more results once all consumers finish
break
else:
total += item #do something with the incremental return values
print(total)
print(exit_values)
#cleanup
producer.join()
print("producer joined")
for c in consumers: c.join()
print("consumers joined")
日期组织它们 - 在这种情况下,我保留最旧的created_at
删除记录以过滤正确的行USING