从csv文件

时间:2016-09-02 05:14:06

标签: python

我正在尝试从csv文件中对重复值列A进行排序,但未在python中获得预期结果。

输入文件:(.csv)

列名:

Uniprot Acc, PDB ID, Ligand ID, Structure Title, Uniprot Recommended Name, Gene Name, Macromolecular Name

我想对重复值和单个Uniport Acc列以及pdb id和配对ID进行排序。

 Input file:
 Uni port Acc  PDB ID  Ligand ID
 * P0AET8   1AHI    NAI
 * P04036   1ARZ    NAI
 * Q59771   1C1D    NAI
 * P0C0F4   1DLJ    NAI
 * Q9QYY9   1E3E    NAI
 * Q9QYY9   1E3I    NAI
 * Q14376   1EK6    NAI
 * Q16836   1F17    NAI
 * P0AET8   1FMC    NAI
 * Q46220   1GIQ    NAI
 * P97852   1GZ6    NAI
 * P07195   1I0Z    NAI
 * P00338   1I10    NAI
 * P11986   1JKI    NAI
 * P10760   1KY5    NAI
 * Q2RSB2   1L7E    NAI
 * Q27743   1LDG    NAI
 * O32080   1LSU    NAI
 * P00334   1MG5    NAI
 * P26392   1N2S    NAI
 * P9WGT1   1NFQ    NAI
 * P0ABH7   1NXG    NAI
 * P05091   1NZW    NAI
 * P05091   1NZZ    NAI
 * P27443   1O0S    NAI
 * P0A6D5   1O9B    NAI
 * P20974   1OG4    NAI
 * P11986   1P1J    NAI

 Expected Result:
 Uni port Acc  PDB ID  Ligand ID
 * P0AET8   1AHI    NAI
 * P0AET8   1FMC    NAI
 * P04036   1ARZ    NAI
 * Q59771   1C1D    NAI
 * P0C0F4   1DLJ    NAI
 * Q9QYY9   1E3E    NAI
 * Q9QYY9   1E3I    NAI
   .
   .
   .





 I want to sort how many uniport acc id same with pdb id along with single id, No need to remove any id.

代码:

import csv
import re
import sys
import os

f1 = csv.reader(open('one.csv', 'rb'))

writer = csv.writer(open("Output_file_1.csv", "wb"))
def has_duplicates(f1):    
    for i in range(0, len(f1)):
        for x in range(i + 1, len(f1)):
            if f1[i] == f1[x]:
                var = f1[i]                    
                writer.writerow(var)

1 个答案:

答案 0 :(得分:1)

您可以先将所有值存储在列表中,然后就可以按排序顺序轻松找到重复值。见下面的代码。

  import csv
  import re
  import sys
  import os

  f1 = csv.reader(open('one.csv', 'rb'))

  writer = csv.writer(open("Output_file_1.csv", "wb"))

  def has_duplicates(f1):
      list = []
      for i in range(0, len(f1)):
          list.append(f1[i])
      for var in set([x for x in list if list.count(x) > 1]):
          writer.writerow(var)  # print only duplicate values in a sorted list

新编辑为预期结果

如果可以使用sorted,但这将给出您的预期结果,但有一点差异。您可以使用以下代码来获得预期结果。

def sort_duplicates(f1):
      for i in range(0, len(f1)):
          f1.insert(f1.index(f1[i])+1, f1[i])
          f1.pop(i+1)
for var in f1:
     writer.writerow(var)  

我已经测试了一个列表。这是结果屏幕截图..

>>> a=['P0AET8', 'Q59771', 'P0C0F4','DFC4H', 'P0AET8','Q59771','ACG5D']
>>> print sorted(a)
['ACG5D', 'DFC4H', 'P0AET8', 'P0AET8', 'P0C0F4', 'Q59771', 'Q59771']

如果你使用上面的代码,那就是结果。

>>> a=['P0AET8', 'Q59771', 'P0C0F4','DFC4H', 'P0AET8','Q59771','ACG5D']
>>> for i in range(0,len(a)):
...             a.insert(a.index(a[i])+1, a[i])
...             a.pop(i+1)

>>> print a
['P0AET8', 'P0AET8', 'Q59771', 'Q59771', 'P0C0F4', 'DFC4H', 'ACG5D']