我有一个csv文件,我希望从中保留唯一记录。在这个文件中,我有第4个字段,其中包含一些文本,然后是人或鼠标名称。喜欢... RHPN1_HUMAN和EPHA5_MOUSE
例如:EPHA5同时出现在人类和鼠标中,因此我想删除此记录,因为RHPN1只出现在人类身上,所以我想保留此记录。
file1.csv
meNOG00001 9606 ENSP00000289013 RHPN1_HUMAN
meNOG00005 10090 ENSMUSP00000060646 EPHA5_MOUSE
meNOG00005 9606 ENSP00000273854 EPHA5_HUMAN
meNOG00006 10090 ENSMUSP00000082503 RGPA1_MOUSE
meNOG00006 9606 ENSP00000202677 RGPA2_HUMAN
meNOG00006 9606 ENSP00000302647 RGPA1_HUMAN
meNOG00010 9606 ENSP00000253669 HAUS8_HUMAN
meNOG00011 10090 ENSMUSP00000017629 TOP2B_MOUSE
meNOG00011 10090 ENSMUSP00000068896 TOP2A_MOUSE
meNOG00011 9606 ENSP00000396704 TOP2B_HUMAN
meNOG00011 9606 ENSP00000411532 TOP2A_HUMAN
output.csv
meNOG00001 9606 ENSP00000289013 RHPN1_HUMAN
meNOG00006 9606 ENSP00000202677 RGPA2_HUMAN
meNOG00010 9606 ENSP00000253669 HAUS8_HUMAN
我试过,但我的代码没有按照我想要的方式工作......
file1 = open("file1.csv", "rU")
reader1 = csv.reader(file1,delimiter=',')
d =[]
c =[]
for row in reader1:
d.append(row[3].split('_')[0])
d=list(set(d))
for row1 in d:
for row2 in reader1:
if row1 == row2[3].split('_')[0]:
c.append(row2)
file1.seek(0)
with open('output.csv', 'w') as f_out:
writer = csv.writer(f_out, delimiter=',')
for k in c:
writer.writerow(k)
答案 0 :(得分:1)
import csv
import collections
data = collections.OrderedDict() # 2
with open("file1.csv", "rU") as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
key = row[3].split('_')[0]
if key in data:
del data[key] # 1
else:
data[key] = row
with open('output.csv', 'w') as f_out:
writer = csv.writer(f_out, delimiter=',')
writer.writerows(data.values())
dict
。如果密钥可以出现两次以上,那么您将需要一种不同的方式来跟踪已经看到的密钥。你可以使用一套。例如,
import csv
import collections
seen = set()
data = collections.OrderedDict()
with open("file1.csv", "rU") as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
key = row[3].split('_')[0]
if key in seen:
del data[key]
else:
data[key] = row
seen.add(key)
with open('output.csv', 'w') as f_out:
writer = csv.writer(f_out, delimiter=',')
writer.writerows(data.values())
答案 1 :(得分:0)
未完整测试,但您可以使用以下内容:
class OD(OrderedDict):
coll = set()
def __setitem__(self, key, value):
if key in self.coll:
try:
del self[key]
except KeyError:
pass
else:
OrderedDict.__setitem__(self, key, value)
self.coll.add(key)
原因是我不确定你是否会有超过2场比赛。例如,如果您有奇数个匹配代码,则无法与字典中的键匹配 - 因为任何奇数个键都将被视为唯一。但是上面的说法会有效。 (虽然这可能有点过分了)
d = OD()
with open("file1.csv", "rU") as f_in:
reader = csv.reader(f_in, delimiter=',')
for row in reader:
key = row[3].split('_')[0]
d[key] = row
with open('output.csv', 'w') as f_out:
writer = csv.writer(f_out, delimiter=',')
for val in d.values():
writer.writerow(val)