我有一个大poems.csv
个文件,其中的条目如下:
"
this is a good poem.
",1
"
this is a bad poem.
",0
"
this is a good poem.
",1
"
this is a bad poem.
",0
我想从中删除重复项:
如果文件没有二进制分类器,我可以删除重复的行,如下所示:
with open(data_in,'r') as in_file, open(data_out,'w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
但这也将删除所有分类。如何删除保留0s
和1s
的重复条目?
预期产出:
"
this is a good poem.
",1
"
this is a bad poem.
",0
答案 0 :(得分:2)
pandas as pd
解决了它:
raw_data = pd.read_csv(data_in)
clean_data = raw_data.drop_duplicates()
clean_data.to_csv(data_out)
答案 1 :(得分:1)
您可以轻松地将线条的两个部分添加到集合中。假设你的"线"由一个字符串和一个整数(或两个字符串)组成,两个元素的tuple
可以是有效的set
元素。 tuple
是不可变的,因此可以删除,并且可以添加到set
。
使用csv.reader
类分割线会容易得多,因为它可以让你将多行诗作为单行阅读等。
import csv with open(data_in, 'r', newline='') as in_file, open(data_out, 'w', newline='') as out_file: reader = csv.reader(in_file) writer = csv.writer(out_file) seen = set() # set for fast O(1) amortized lookup for row in reader: row = tuple(row) if row in seen: continue # skip duplicate seen.add(row) writer.writerow(row)
由于您的文件中肯定有一堆多行值,因此使用newline=''
对输入和输出都至关重要,因为它将行拆分委托给csv
类。
使用pandas或其他预加载整个文件的库这样做的好处是,它避免了一次将多个重复的诗加载到内存中。每首诗的一个副本将保留在set
中,但对于包含大量重复文件的非常大的文件,此解决方案更接近于最佳。
我们可以使用以下文件进行测试:
"Error 404:
Your Haiku could not be found.
Try again later.", 0
"Error 404:
Your Haiku could not be found.
Try again later.", 1
"Error 404:
Your Haiku could not be found.
Try again later.", 0
"Error 404:
Your Haiku could not be found.
Try again later.", 1
输出如下:
"Error 404:
Your Haiku could not be found.
Try again later.", 0
"Error 404:
Your Haiku could not be found.
Try again later.", 1
关于Python 2的注意事项
{2}的Python 2版本中不存在参数newline
。这在大多数操作系统上都不会成为问题,因为行结尾在输入和输出文件之间将是内部一致的。 {2}的Python 2版本请求以二进制模式打开文件,而不是指定newline=''
。
<强>更新强>
您已表明自己答案的行为不是100%正确。看来你的数据使它成为一种完全有效的方法,所以我保留了我的答案的前一部分。
为了能够仅按诗过滤,忽略(但保留)第一次出现的二元分类器,您不需要在代码中进行更改:
import csv
with open(data_in, 'r', newline='') as in_file, open(data_out, 'w', newline='') as out_file:
reader = csv.reader(in_file)
writer = csv.writer(out_file)
seen = set() # set for fast O(1) amortized lookup
for row in reader:
if row[0] in seen: continue # skip duplicate
seen.add(row[0])
writer.writerow(row)
由于零分类器首先出现在文件中,因此上述测试用例的输出将为:
"Error 404:
Your Haiku could not be found.
Try again later.", 0
我在评论中提到,你也可以保留最后看到的分类器,如果找到它也总是保留一个。对于这两个选项,我建议使用open
(或csv
如果你想保留诗歌的原始顺序),这首诗是由诗歌键入的,值是分类器。字典的密钥基本上是set
。加载整个输入文件后,您最终也会编写输出文件。
保留最后一个分类符:
import csv
from collections import OrderedDict
with open(data_in, 'r', newline='') as in_file:
reader = csv.reader(in_file)
writer = csv.writer(out_file)
seen = OrderedDict() # map for fast O(1) amortized lookup
for poem, classifier in reader:
seen[poem] = classifier # Always update to get the latest classifier
with open(data_out, 'w', newline='') as out_file:
for row in seen.items():
writer.writerow(row)
seen.items()
遍历包含密钥(诗)和值(分类器)的元组,这恰好是您要写入文件的内容。
此版本的输出将有一个分类器,因为它出现在上面的测试输入的最后:
"Error 404:
Your Haiku could not be found.
Try again later.", 1
类似的方法可以保留1个分类器(如果存在):
import csv
from collections import OrderedDict
with open(data_in, 'r', newline='') as in_file:
reader = csv.reader(in_file)
writer = csv.writer(out_file)
seen = OrderedDict() # map for fast O(1) amortized lookup
for poem, classifier in reader:
if poem not in seen or classifier == '1'
seen[poem] = classifier
with open(data_out, 'w', newline='') as out_file:
for row in seen.items():
writer.writerow(row)