我想找到一列的重复值,并替换为具有多列的另一列csv的值。所以首先我把csv中的两列放到字典中。然后我想找到具有字符串值和键的字典的重复值。我尝试了删除重复的字典的解决方案,但得到错误,因为不可清洗或没有结果。这是代码的第一部分。
import csv
from collections import defaultdict
import itertools as it
mydict = {}
index = 0
reader = csv.reader(open(r"computing.csv", "rb"))
for i, rows in enumerate(reader):
if i == 0:
continue
if len(rows) == 0:
continue
k = rows[3].strip()
v = rows[2].strip()
if k in mydict:
mydict[k].append(v)
else:
mydict[k] = [v]
#mydict = hash(frozenset(mydict))
print mydict
d = {}
while True:
try:
d = defaultdict(list)
for k,v in mydict.iteritems():
#d[frozenset(mydict.items())]
d[v].append(k)
except:
continue
writer = csv.writer(open(r"OLD.csv", 'wb'))
for key, value in d.items():
writer.writerow([key, value])
答案 0 :(得分:2)
你的问题不清楚。所以我希望我做对了。
请举例说明输入列和所需的输出列。 请提供错误的打印输出,并告诉我们哪一行导致错误。
如果column1=[1,2,3,1,4]
和column2=[a,b,c,d,e]
您希望输出为n_column1=[a,2,3,d,4]
和column2 =[1,b,c,d,e]
我想这个例外是在d[v].append(k)
,因为显然v是一个列表。你不能将列表用作字典中的键。
In [1]: x = [1,2,3,1,4]
In [2]: y = ['a','b','c','d','e']
In [5]: from collections import defaultdict
In [6]: d = defaultdict(int)
In [7]: for a in x:
...: d[a] += 1
In [8]: d
Out[8]: defaultdict(<type 'int'>, {1: 2, 2: 1, 3: 1, 4: 1})
In [9]: x2 = []
In [10]: for a,b in zip(x,y):
....: x2.append(a if d[a]==1 else b)
....:
In [11]: x
Out[11]: [1, 2, 3, 1, 4]
In [12]: x2
Out[12]: ['a', 2, 3, 'd', 4]
在这种情况下,我想如果我必须更改你的代码以适应。我会做那样的事情:
import csv
from collections import defaultdict
import itertools as it
mydict = {}
index = 0
reader = csv.reader(open(r"computing.csv", "rb"))
histogram = defaultdict(int)
k = []
v = []
for i, rows in enumerate(reader):
if i == 0:
continue
if len(rows) == 0:
continue
k.append(rows[3].strip())
v.append(rows[2].strip())
item = k[-1]
histogram[item] += 1
output_column = []
for first_item, second_item in zip(k,v):
output_column.append(first_item if histogram[first_item]==1 else second_item)
writer = csv.writer(open(r"OLD.csv", 'wb'))
for c1, c2 in zip(output_column, v):
writer.writerow([c1, c2])