我正在尝试从csv文件中对重复值列A进行排序,但未在python中获得预期结果。
输入文件:(.csv)
列名:
Uniprot Acc, PDB ID, Ligand ID, Structure Title, Uniprot Recommended Name, Gene Name, Macromolecular Name
我想对重复值和单个Uniport Acc列以及pdb id和配对ID进行排序。
Input file:
Uni port Acc PDB ID Ligand ID
* P0AET8 1AHI NAI
* P04036 1ARZ NAI
* Q59771 1C1D NAI
* P0C0F4 1DLJ NAI
* Q9QYY9 1E3E NAI
* Q9QYY9 1E3I NAI
* Q14376 1EK6 NAI
* Q16836 1F17 NAI
* P0AET8 1FMC NAI
* Q46220 1GIQ NAI
* P97852 1GZ6 NAI
* P07195 1I0Z NAI
* P00338 1I10 NAI
* P11986 1JKI NAI
* P10760 1KY5 NAI
* Q2RSB2 1L7E NAI
* Q27743 1LDG NAI
* O32080 1LSU NAI
* P00334 1MG5 NAI
* P26392 1N2S NAI
* P9WGT1 1NFQ NAI
* P0ABH7 1NXG NAI
* P05091 1NZW NAI
* P05091 1NZZ NAI
* P27443 1O0S NAI
* P0A6D5 1O9B NAI
* P20974 1OG4 NAI
* P11986 1P1J NAI
Expected Result:
Uni port Acc PDB ID Ligand ID
* P0AET8 1AHI NAI
* P0AET8 1FMC NAI
* P04036 1ARZ NAI
* Q59771 1C1D NAI
* P0C0F4 1DLJ NAI
* Q9QYY9 1E3E NAI
* Q9QYY9 1E3I NAI
.
.
.
I want to sort how many uniport acc id same with pdb id along with single id, No need to remove any id.
代码:
import csv
import re
import sys
import os
f1 = csv.reader(open('one.csv', 'rb'))
writer = csv.writer(open("Output_file_1.csv", "wb"))
def has_duplicates(f1):
for i in range(0, len(f1)):
for x in range(i + 1, len(f1)):
if f1[i] == f1[x]:
var = f1[i]
writer.writerow(var)
答案 0 :(得分:1)
您可以先将所有值存储在列表中,然后就可以按排序顺序轻松找到重复值。见下面的代码。
import csv
import re
import sys
import os
f1 = csv.reader(open('one.csv', 'rb'))
writer = csv.writer(open("Output_file_1.csv", "wb"))
def has_duplicates(f1):
list = []
for i in range(0, len(f1)):
list.append(f1[i])
for var in set([x for x in list if list.count(x) > 1]):
writer.writerow(var) # print only duplicate values in a sorted list
如果可以使用sorted
,但这将给出您的预期结果,但有一点差异。您可以使用以下代码来获得预期结果。
def sort_duplicates(f1):
for i in range(0, len(f1)):
f1.insert(f1.index(f1[i])+1, f1[i])
f1.pop(i+1)
for var in f1:
writer.writerow(var)
我已经测试了一个列表。这是结果屏幕截图..
>>> a=['P0AET8', 'Q59771', 'P0C0F4','DFC4H', 'P0AET8','Q59771','ACG5D']
>>> print sorted(a)
['ACG5D', 'DFC4H', 'P0AET8', 'P0AET8', 'P0C0F4', 'Q59771', 'Q59771']
如果你使用上面的代码,那就是结果。
>>> a=['P0AET8', 'Q59771', 'P0C0F4','DFC4H', 'P0AET8','Q59771','ACG5D']
>>> for i in range(0,len(a)):
... a.insert(a.index(a[i])+1, a[i])
... a.pop(i+1)
>>> print a
['P0AET8', 'P0AET8', 'Q59771', 'Q59771', 'P0C0F4', 'DFC4H', 'ACG5D']