目标
我已经从hotmail下载了一个CSV文件,但它有很多重复项。这些副本是完整的副本,我不知道为什么我的手机创建它们。
我想摆脱重复。
方法
编写一个python脚本来删除重复项。
技术规范
Windows XP SP 3 Python 2.7 CSV file with 400 contacts
答案 0 :(得分:48)
更新:2016
如果您愿意使用有用的more_itertools
外部库:
from more_itertools import unique_everseen
with open('1.csv','r') as f, open('2.csv','w') as out_file:
out_file.writelines(unique_everseen(f))
@ IcyFlame解决方案的更有效版本
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
要就地编辑同一个文件,您可以使用此
import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
if line in seen: continue # skip duplicate
seen.add(line)
print line, # standard output is now redirected to the file
答案 1 :(得分:15)
你可以使用Pandas有效地实现重复数据删除:
import pandas as pd
file_name = "my_file_with_dupes.csv"
file_name_output = "my_file_without_dupes.csv"
df = pd.read_csv(file_name, sep="\t or ,")
# Notes:
# - the `subset=None` means that every column is used
# to determine if two rows are different; to change that specify
# the columns as an array
# - the `inplace=True` means that the data structure is changed and
# the duplicate rows are gone
df.drop_duplicates(subset=None, inplace=True)
# Write the results to a different file
df.to_csv(file_name_output)
答案 2 :(得分:6)
您可以使用以下脚本:
<强>先决条件:强>
1.csv
是包含重复项的文件2.csv
是执行此脚本后将缺少重复项的输出文件。<强>码强>
inFile = open('1.csv','r')
outFile = open('2.csv','w')
listLines = []
for line in inFile:
if line in listLines:
continue
else:
outFile.write(line)
listLines.append(line)
outFile.close()
inFile.close()
算法说明
在这里,我正在做的是:
答案 3 :(得分:2)
我知道这已经解决了很久,但是我遇到了一个密切相关的问题,即我要根据一个列删除重复项。输入的csv文件很大,可以通过MS Excel / Libre Office Calc / Google Sheets在我的电脑上打开; 147MB,大约有250万条记录。由于我不想为这样一个简单的事情安装整个外部库,因此我在下面编写了python脚本以在不到5分钟的时间内完成该工作。我没有专注于优化,但我相信可以对其进行优化以针对更大的文件更快,更高效地运行。该算法与上面的@IcyFlame相似,不同之处在于,我基于列(CCC)而不是整个行/行删除重复项。
import csv
with open('results.csv', 'r') as infile, open('unique_ccc.csv', 'a') as outfile:
# this list will hold unique ccc numbers,
ccc_numbers = []
# read input file into a dictionary, there were some null bytes in the infile
results = csv.DictReader(infile)
writer = csv.writer(outfile)
# write column headers to output file
writer.writerow(
['ID', 'CCC', 'MFLCode', 'DateCollected', 'DateTested', 'Result', 'Justification']
)
for result in results:
ccc_number = result.get('CCC')
# if value already exists in the list, skip writing it whole row to output file
if ccc_number in ccc_numbers:
continue
writer.writerow([
result.get('ID'),
ccc_number,
result.get('MFLCode'),
result.get('datecollected'),
result.get('DateTested'),
result.get('Result'),
result.get('Justification')
])
# add the value to the list to so as to be skipped subsequently
ccc_numbers.append(ccc_number)
答案 4 :(得分:1)
@ jamylak解决方案的更高效版本:(少一条指令)
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line not in seen:
seen.add(line)
out_file.write(line)
要就地编辑同一个文件,您可以使用此
import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
if line not in seen:
seen.add(line)
print line, # standard output is now redirected to the file
答案 5 :(得分:0)
您可以在jupyter笔记本或相关的IDE中使用熊猫库,将熊猫导入jupyter笔记本并读取csv文件
然后对这些值进行排序(根据存在重复参数的情况),因为我定义了两个属性,所以它将首先按时间排序,然后按纬度排序
然后删除与您相关的时间列或相关列中的重复项
然后,我将删除并排序的重复文件存储为gps_sorted
import pandas as pd
stock=pd.read_csv("C:/Users/Donuts/GPS Trajectory/go_track_trackspoints.csv")
stock2=stock.sort_values(["time","latitude"],ascending=True)
stock2.drop_duplicates(subset=['time'])
stock2.to_csv("C:/Users/Donuts/gps_sorted.csv",)
希望这会有所帮助