Question

我在这样的文件中有两个字符串：

>1
atggca---------gtgtggcaatcggcacat
>2
atggca---------gtgtggcaatcggcacat

在biopython中使用alignIO函数：

from Bio import AlignIO
print AlignIO.read("neighbor.fas", "fasta")

返回：

SingleLetterAlphabet() alignment with 2 rows and 33 columns
atggca---------gtgtggcaatcggcacat 1
atggca---------gtgtggcaatcggcacat 2

我想计算此对齐中两列之间的百分比同一性。

row = align[:,n]

允许提取可比较的各列仅包含“ - ”的列将不计算在内。

以下代码行有效，但速度很慢

from Bio import AlignIO

align = AlignIO.read("neighbor.fas", "fasta")
for n in range(0,len(align[0])):
    n=0
    i=0
    y=0
    while n<len(align[0]):
        column = align[:,n]
        for c in column:
            if c[0]==c[1]:
                if c[0]!="-":
                    i=i+1
                else:
                    y=y+1 # this counts gap only columns, remove them later

        n=n+1
match= float(i/2)
length= float(len(align[0])-y/2)
identity =  100*(float(match/length))

print identity

我很感激在优化这个

方面提供帮助

谢谢！

编辑：可能的答案：

虽然采用了不同的方法，但速度要快得多！

from Bio import AlignIO


align = AlignIO.read("neighbor.fas", "fasta")
A=list(align[0])
B=list(align[1])
count=0
gaps=0
for n in range(0, len(A)):
    if A[n]==B[n]:
        if A[n]!="-":
            count=count+1
        else:
            gaps=gaps+1
print 100*(count/float((len(A)-gaps)))

Answer 1

这是一个快速但不是生物学上准确的答案。

使用Levenshtein Python扩展和C库。

http://code.google.com/p/pylevenshtein/

Levenshtein Python C扩展模块包含快速计算的函数 - Levenshtein（编辑）距离和编辑操作 - 字符串相似性 - 近似中值字符串，通常字符串平均 - 字符串序列和集合相似性它支持普通字符串和Unicode字符串

由于这些序列是字符串，为什么不呢！

sudo pip install python-Levenshtein

然后启动ipython：

In [1]: import Levenshtein

In [3]: Levenshtein.ratio('atggca---------gtgtggcaatcggcacat'.replace('-',''),
                          'atggca---------gtgtggcaatcggcacat'.replace('-','')) * 100
Out[3]: 100.0

In [4]: Levenshtein.ratio('atggca---------gtgtggcaatcggcacat'.replace('-',''),
                          'atggca---------gtgtggcaatcggcacaa'.replace('-','')) * 100
Out[4]: 95.83333333333334

Answer 2

我知道这个问题很老但是，既然你已经在进行biopython了，那么你是不是只能继续使用BLAST记录类（教程http://biopython.org/DIST/docs/tutorial/Tutorial.html的第7章）？

我相信你需要的选项（＆＃34; 7.4 BLAST记录课＆＃34;）是＆＃34; hsp.identities＆＃34;。

Answer 3

如果你想将它扩展到两个以上的序列，下面的效果很好，它会得到与BioPerl overall_percentage_identity函数（http://search.cpan.org/dist/BioPerl/Bio/SimpleAlign.pm）相同的结果。

from Bio import AlignIO

align = AlignIO.read("neighbor.fas", "fasta")
print perc_identity(align)

def perc_identity(aln):
    i = 0
    for a in range(0,len(aln[0])):
        s = aln[:,a]
        if s == len(s) * s[0]:
            i += 1
    return 100*i/float(len(aln[0]))

在python中：计算两个字符串之间的百分比

3 个答案: