Question

我使用以下python脚本从CSV文件中删除重复项

with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate

        seen.add(line)
        out_file.write(line)

我正在尝试修改它，以便不输出没有重复的列表到final.csv，而是输出找到的唯一值。

与现在的情况相反。有人有例子吗？

Answer 1

使用dict跟踪每行发生的次数，然后您可以处理dict并仅将唯一项添加到seen集，并将其写入final.csv：< / p>

from collections import defaultdict
uniques = defaultdict(int)
with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        uniques[line] +=1
    for k, v in uniques.iteritems():
        if v = 1:
            seen.add(k)
            out_file.write(k)

或者：

from collections import defaultdict
uniques = defaultdict(int)
with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        uniques[line] +=1

    seen = set(k for k in uniques if uniques[k] == 1)
    for itm in seen:
        out_file.write(itm)

或者，使用Counter：

from collections import Counter

with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    lines = Counter(file.readlines())
    seen = set(k for k in lines if lines[k] == 1)
    for itm in seen:
        out_file.write(itm)

这将输出仅出现一次的行，具体取决于您的意思＆＃34; uniques＆＃34;，这可能是也可能不正确。相反，如果要输出所有行，但每行只输出一个实例，请使用最后一个方法：

with open('test.csv','r') as in_file, open('final.csv','w') as out_file:

    lines = Counter(file.readlines())

    for itm in lines:
        out_file.write(itm)

Answer 2

您可以将dublicates收集到另一个变量，并使用它们从集合中删除不唯一的值。

Python从CSV中提取唯一值

2 个答案: