数据:
112343 The data point was created on 1903.
112344 The data point was created on 1909.
112345 The data point was created on 1919.
112346 The data point was created on 1911.
112346 The data point was created on 1911-12.
112346 The data point was created on 1911-12.
112347 The data point was created on 1911.
112348 The data point was created on 1911.
这里重复是id。我希望删除重复项,但我想保留最长的行[1](如理想输出中所示)。
以下是我的尝试:
import sys
import csv
import re
import string
df = csv.reader(‘fil.csv’, delimiter = ',')
for r in df:
dup = next(df)
if r[0] == dup[0]:
r[1] < dup[1]: #I am checking if the text is larger then the previous
print dup[0], dup[1]
else:
print r[0], r[1]
但是我得到了输出,
112343 The data point was created on 1903.
112346 The data point was created on 1911-12.
112346 The data point was created on 1911-12.
112346 The data point was created on 1911.
112348 The data point was created on 1911.
缺少行!
理想的输出是
112343 The data point was created on 1903.
112344 The data point was created on 1909.
112345 The data point was created on 1919.
112346 The data point was created on 1911-12.
112347 The data point was created on 1911.
112348 The data point was created on 1911.
如何实现这一目标?我可以使用什么条件或关键字?或者我可以有两个重复的文件,并比较它们之间的行,以消除重复?
答案 0 :(得分:1)
我的尝试:
import csv
import collections
csv_input = """ 112343, The data point was created on 1903.
112344, The data point was created on 1909.
112345, The data point was created on 1919.
112346, The data point was created on 1911.
112346, The data point was created on 1911-12.
112346, The data point was created on 1911-12.
112347, The data point was created on 1911.
112348, The data point was created on 1911."""
reader = csv.reader(csv_input.split('\n'))
result = collections.OrderedDict()
for row_id, data in reader:
if len(result.get(row_id, ''))<len(data):
result[row_id] = data
for row_id, data in result.items():
print "{},{}".format(row_id, data)
答案 1 :(得分:1)
试试这个:
some_dict = {}
file_name = "sample.csv"
with open(file_name) as f:
data = csv.reader(f,delimiter = ' ')
for row in data:
key = row.pop(0)
if key in some_dict:
if len(row[0])>len(some_dict[key]):
some_dict[key] = row.pop(0)
else:
some_dict[key] = row.pop(0)
for key,value in some_dict.iteritems():
print key,value
答案 2 :(得分:1)
我的解决方案是 -
import csv
unqkey =set()
data = []
with open("C:\data.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
unqkey.add(row[0])
data.append(row)
unqkey = sorted(list(unqkey))
for i in unqkey:
r=[]
for j in data:
if j[0]==i:
r.append(' '.join(j))
r.sort(key=len)
print r[-1]
打印 -
112343 The data point was created on 1903.
112344 The data point was created on 1909.
112345 The data point was created on 1919.
112346 The data point was created on 1911-12.
112347 The data point was created on 1911.
112348 The data point was created on 1911.
答案 3 :(得分:1)
我正在努力(不是无理)假设您的数据始终在id
上排序。
初始化
from sys import maxint
prev_id = maxint
longest = ""
data = open('myfile.dat')
数据循环
for row in data:
curr_id = int(row.split()[0])
if prev_id < curr_id:
print longest
longest = row
elif len(row)>len(longest):
longest = row
prev_id = curr_id
# here we have still one row to output
print longest
这个答案的相对优点在于其内存效率,因为行是逐个处理的。当然,这种效率取决于我在数据文件中假设的顺序!
答案 4 :(得分:1)
这就是我删除重复项的方法。
首先,我通过Excel删除了重复项。但是仍然有一些其他重复项具有不同的列大小(相同的id但行[1]的长度不同)。在重复的一对行中,我希望第二列具有更大的行(len(行[1]更高)。这是我做的,
import csv
import sys
dfo = open('fil.csv', 'rU')
df = csv.reader(dfo, delimiter = ',')
temp = ''
temp1 = ''
for r in reversed(list(df)):
if r[0] == temp:
continue
elif len(r[1]) > len(temp1):
print r[0] + '|' + r[1] + '|' + r[2] + '|' + r[3]
#I used | for the csv separation.
else:
print r[0] + '|' + r[1] + '|' + r[2] + '|' + r[3]
temp = r[0]
temp1 = r[1]
这照顾了重复。在这里,我基本上跳过了较小的r [1]的重复行。现在我打印出反转列表。我将它保存在csv文件中,然后再次反向打印这个新文件(恢复原始顺序)。它解决了这个问题。
答案 5 :(得分:0)
您的代码跳过行的原因是next
函数。在我的解决方案中,我首先将所有行读入list
,然后按第二列对list
进行排序,如果第一列值相同,我们只保留第一行,并跳过其他行
import csv
from operator import itemgetter
with open('file.csv', 'rb') as f:
reader = csv.reader(f)
your_list = list(reader)
your_list.sort(key=itemgetter(1)) # sorted by the second column
result = [your_list[0]] # to store the filtered results
for index in range(1,len(your_list)):
if your_list[index] != your_list[index-1][0]:
result.append(your_list[index])
print result
答案 6 :(得分:0)
如何从CSV中删除重复的行?
在Excel中打开CSV。 Excel有一个内置工具,允许您删除重复项。关注this tutorial以获取更多信息。