蛋白质互信息

时间:2017-03-13 20:38:40

标签: python bioinformatics biopython

我正在尝试在多序列比对(MSA)之间找到互信息(MI)。

背后的数学对我来说没问题。虽然,我不知道如何在Python中实现它,至少在很快的方式。

我应该如何计算整体频率P(i;x); P(j;y); P(ij;xy)PxPy频率很容易计算,哈希可以处理它,但是P(ij;xy)呢?

所以我真正的问题是,如何计算给定i和j列中Pxy的概率?

请注意MI可以定义为:

MI(i,j) = Sum(x->n)Sum(y->m) P(ij,xy) * log(P(ij,xy)/P(i,x)*P(j,y))

其中i和j是列中的氨基酸位置,x和y是在给定的i或j列中发现的不同氨基酸。

谢谢,

修改

我的输入数据看起来像df:

A = [
['M','T','S','K','L','G','-'.'-','S','L','K','P'],
['M','A','A','S','L','A','-','A','S','L','P','E'],
...,
['M','T','S','K','L','G','A','A','S','L','P','E'],
]

所以确实在给定位置计算任何频率的氨基酸都非常容易,  例如:

P(M) at position 1: 1
P(T) at position 2: 2/3
P(A) at position 2: 1/3
P(S) at position 3: 2/3
P(A) at position 3: 1/3

我应该如何继续获取,例如,同时获得位置2的T和位置3的S: 在这个例子中是2/3。

因此,P(ij,xy)表示列i中氨基酸x的概率(或频率)i在列j中同时出现氨基酸y

Ps:有关MI的更简单说明,请参阅此链接mistic.leloir.org.ar/docs/help.html'感谢Aaron'

2 个答案:

答案 0 :(得分:1)

我不是100%确定这是否正确(例如,应该如何处理'-')?我假设总和超过log内的分子和分母中的频率都非零的所有对,此外,我假设它应该是自然日志:

from math import log
from collections import Counter

def MI(sequences,i,j):
    Pi = Counter(sequence[i] for sequence in sequences)
    Pj = Counter(sequence[j] for sequence in sequences)
    Pij = Counter((sequence[i],sequence[j]) for sequence in sequences)   

    return sum(Pij[(x,y)]*log(Pij[(x,y)]/(Pi[x]*Pj[y])) for x,y in Pij)

代码的工作原理是使用3个Counter个对象来获取相关计数,然后返回一个直接翻译公式的和。

如果这不正确,如果你编辑你的问题以便它有一些预期的输出来测试它会很有帮助。

开启编辑。这是一个版本,它不会将'-'视为另一个氨基酸,而是过滤掉它出现在两列中任何一列中的序列,将这些序列解释为无法获得必要信息的序列:< / p>

def MI(sequences,i,j):
    sequences = [s for s in sequences if not '-' in [s[i],s[j]]]
    Pi = Counter(s[i] for s in sequences)
    Pj = Counter(s[j] for s in sequences)
    Pij = Counter((s[i],s[j]) for s in sequences)

    return sum(Pij[(x,y)]*log(Pij[(x,y)]/(Pi[x]*Pj[y])) for x,y in Pij)

答案 1 :(得分:0)

这是一个开始的地方...阅读评论

import numpy as np

A = [  # you'll need to pad the end of your strings so that they're all the 
       # same length for this to play nice with numpy
"MTSKLG--SLKP",
"MAASLA-ASLPE",
"MTSKLGAASLPE"]

#create an array of bytes
B = np.array([np.fromstring(a, dtype=np.uint8) for a in A],)

#create search string to do bytetwise xoring
#same length as B.shape[1]
search_string = "-TS---------"  # P of T at pos 1 and S at pos 2
               #"M-----------"  # P of M at pos 0 

#take ord of each char in string
search_ord = np.fromstring(search_string, dtype=np.uint8)

#locate positions not compared
search_mask = search_ord != ord('-')

#xor with search_ord. 0 indicates letter in that position matches
#multiply with search_mask to force uninteresting positions to 0
#any remaining arrays that are all 0 are a match. ("any()" taken along axis 1)
#this prints [False, True, False]. take the sum to get the number of non-matches
print(((B^search_ord) * search_mask).any(1))