Question

我正在尝试根据某个参数对序列文件进行排序。数据如下：

ID1 ID2 32

MVKVYAPASSANMSVGFDVLGAAVTP ......

ID1 ID2 18

MKLYNLKDHNEQVSFAQAVTQGLGKN ......

...

大约有3000个这样的序列，即第一行包含两个ID字段和一个rank字段（排序键），而第二行包含序列。我的方法是打开文件，将文件对象转换为列表对象，将注释行（ID1，ID2，rank）与实际序列分开（注释行总是出现在偶数索引上，而序列行总是出现在奇数索引上），将它们合并到字典中，并使用rank字段对字典进行排序。代码如下所示：

#!/usr/bin/python

with open("unsorted.out","rb") as f:
    f = f.readlines()

assert type(f) == list, "ERROR: file object not converted to list"

annot=[]
seq=[]

for i in range(len(f)):
    # IDs
    if i%2 == 0:
        annot.append(f[i])
    # Sequences     
    elif i%2 != 0:
        seq.append(f[i])

# Make dictionary
ids_seqs = {}         
ids_seqs = dict(zip(annot,seq))

# Solub rankings are the third field of the annot list, i.e. annot[i].split()[2]
# Use this index notation to rank sequences according to solubility measurements 

sorted_niwa = sorted(ids_seqs.items(), key = lambda val: val[0].split()[2], reverse=False)

# Save to file
with open("sorted.out","wb") as out:
    out.write("".join("%s %s" % i for i in sorted_niwa))

我遇到的问题是，当我打开已排序的文件以手动检查时，当我向下滚动时，我注意到某些序列被错误地排序。例如，我看到排名第89位排在第89位之后。直到某一点排序是正确的，但我不明白为什么它在整个过程中都没有。

非常感谢您的帮助！

Answer 1

听起来你在比较字符串而不是数字。 “9”＆gt; “89”因为字符'9'在字符'8'之后按字典顺序排列。尝试转换为密钥中的整数。

sorted_niwa = sorted(ids_seqs.items(), key = lambda val: int(val[0].split()[2]), reverse=False)

字典

1 个答案: