Question

我有一个大poems.csv个文件，其中的条目如下：

"
this is a good poem. 
",1

"  
this is a bad poem.    
",0

"
this is a good poem. 
",1

"  
this is a bad poem.    
",0

我想从中删除重复项：

如果文件没有二进制分类器，我可以删除重复的行，如下所示：

with open(data_in,'r') as in_file, open(data_out,'w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate
        seen.add(line)
        out_file.write(line)

但这也将删除所有分类。如何删除保留0s和1s的重复条目？

预期产出：

"
this is a good poem. 
",1

"  
this is a bad poem.    
",0

Answer 1

pandas as pd解决了它：

raw_data = pd.read_csv(data_in)
clean_data = raw_data.drop_duplicates()
clean_data.to_csv(data_out)

Answer 2

您可以轻松地将线条的两个部分添加到集合中。假设你的＆＃34;线＆＃34;由一个字符串和一个整数（或两个字符串）组成，两个元素的tuple可以是有效的set元素。 tuple是不可变的，因此可以删除，并且可以添加到set。

使用csv.reader类分割线会容易得多，因为它可以让你将多行诗作为单行阅读等。

import csv

with open(data_in, 'r', newline='') as in_file, open(data_out, 'w', newline='') as out_file:
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    seen = set() # set for fast O(1) amortized lookup
    for row in reader:
        row = tuple(row)
        if row in seen: continue # skip duplicate
        seen.add(row)
        writer.writerow(row)

由于您的文件中肯定有一堆多行值，因此使用newline=''对输入和输出都至关重要，因为它将行拆分委托给csv类。

使用pandas或其他预加载整个文件的库这样做的好处是，它避免了一次将多个重复的诗加载到内存中。每首诗的一个副本将保留在set中，但对于包含大量重复文件的非常大的文件，此解决方案更接近于最佳。

我们可以使用以下文件进行测试：

"Error 404:
Your Haiku could not be found.
Try again later.", 0
"Error 404:
Your Haiku could not be found.
Try again later.", 1
"Error 404:
Your Haiku could not be found.
Try again later.", 0
"Error 404:
Your Haiku could not be found.
Try again later.", 1

输出如下：

"Error 404:
Your Haiku could not be found.
Try again later.", 0
"Error 404:
Your Haiku could not be found.
Try again later.", 1

关于Python 2的注意事项

{2}的Python 2版本中不存在参数newline。这在大多数操作系统上都不会成为问题，因为行结尾在输入和输出文件之间将是内部一致的。 {2}的Python 2版本请求以二进制模式打开文件，而不是指定newline=''。

<强>更新

您已表明自己答案的行为不是100％正确。看来你的数据使它成为一种完全有效的方法，所以我保留了我的答案的前一部分。

为了能够仅按诗过滤，忽略（但保留）第一次出现的二元分类器，您不需要在代码中进行更改：

import csv

with open(data_in, 'r', newline='') as in_file, open(data_out, 'w', newline='') as out_file:
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    seen = set() # set for fast O(1) amortized lookup
    for row in reader:
        if row[0] in seen: continue # skip duplicate
        seen.add(row[0])
        writer.writerow(row)

由于零分类器首先出现在文件中，因此上述测试用例的输出将为：

"Error 404:
Your Haiku could not be found.
Try again later.", 0

我在评论中提到，你也可以保留最后看到的分类器，如果找到它也总是保留一个。对于这两个选项，我建议使用open（或csv如果你想保留诗歌的原始顺序），这首诗是由诗歌键入的，值是分类器。字典的密钥基本上是set。加载整个输入文件后，您最终也会编写输出文件。

保留最后一个分类符：

import csv
from collections import OrderedDict

with open(data_in, 'r', newline='') as in_file:
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    seen = OrderedDict() # map for fast O(1) amortized lookup
    for poem, classifier in reader:
        seen[poem] = classifier # Always update to get the latest classifier

with open(data_out, 'w', newline='') as out_file:
    for row in seen.items():
        writer.writerow(row)

seen.items()遍历包含密钥（诗）和值（分类器）的元组，这恰好是您要写入文件的内容。

此版本的输出将有一个分类器，因为它出现在上面的测试输入的最后：

"Error 404:
Your Haiku could not be found.
Try again later.", 1

类似的方法可以保留1个分类器（如果存在）：

import csv
from collections import OrderedDict

with open(data_in, 'r', newline='') as in_file:
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    seen = OrderedDict() # map for fast O(1) amortized lookup
    for poem, classifier in reader:
        if poem not in seen or classifier == '1'
            seen[poem] = classifier

with open(data_out, 'w', newline='') as out_file:
    for row in seen.items():
        writer.writerow(row)

Python - 从csv文件中删除重复的条目

2 个答案: