从非对称矩阵中选择元素

时间:2013-06-06 04:16:19

标签: r

我有一份蛋白质及其相互作用的列表,我有兴趣知道不同蛋白质中共享的相互作用者的百分比。

我的蛋白质和相互作用者列表如下所示:

head(lista)
$`A1CF `
[1] " A1CF"    " APOBEC1" " CUGBP2"  " KHSRP"   " SYNCRIP" " TNPO2"  

$`A2LD1 `
[1] " A2LD1"   " PRPSAP2" " RPL15"   " TANC1"  

$`A2M `
[1] " A2M"      " ADAM19"   " ADAMTS1"  " AMBP"     " ANXA6"    " APOE"     " APP"      "    B2M"      " C11orf58" " CELA1"    " CPB2"     " CTSB"     " CTSE"    
[14] " F2"       " HSPA5"    " IL10"     " IL1B"     " KLK13"    " KLK2"     " KLK3"     " KLK5"     " KLKB1"    " LCAT"     " LEP"      " LRP1"     " MMP2"    
[27] " MYOC"     " NGF"      " PAEP"     " PDGFA"    " PDGFB"    " PLG"      " SERPINA1" "  SHBG"     " SPACA3"   " TGFBI"   

$`AAAS `
[1] " AAAS"    " ARHGAP1" " BANF1"   " CCNG2"   " EP300"   " HMGA1"   " KPNB1"   " NUP107"  " NUP133"  " NUP153"  " NUP155"  " NUP160"  " NUP188"  " NUP205" 
[15] " NUP210"  " NUP214"  " NUP35"   " NUP37"   " NUP43"   " NUP50"   " NUP54"   " NUP62"   " NUP85"   " NUP88"   " NUP93"   " NUP98"   " NUPL1"   " NUPL2"  
[29] " PLK4"    " POM121C" " PSIP1"   " RAE1"    " RAN"     " RANBP2"  " SEH1L"   " TARDBP"  " TPR"     " TTK"     " XPO1"   

$`AAGAB `
[1] " AAGAB"  " AFTPH"  " EIF3C"  " UNC119"

$`AAK1 `
[1] " AAK1"     " ACOX3"    " ADAM28"   " ALPK3"    " AURKB"    " AZI2"     " BMP2K"    " CABC1"    " CAMK2G"   " DCK"      " DCTPP1"   " EIF2AK1"  " FAM83A"  
[14] " FER"      " FRYL"     " GAPVD1"   " GFPT1"    " HIPK1"    " JAK1"     " KIAA0195" " KIAA0528" " LIMK2"    " LSM14A"   " MAP4K2"   " MAP4K5"   " MAPK6"   
[27] " NEK11"    " NQO2"     " NUMB"     " PDE4A"    " PIP4K2C"  " PKN3"     " PRKAA1"   " PTPN18"   " SIK2"     " SIK3"     " SPEG"     " TAOK1"    " TAOK3"   
[40] " TBK1"     " TBKBP1"   " TESK2"    " TMX1"     " TNK1"     " ZAK" 

为了获得蛋白质中共享的相互作用者的百分比,我已经完成了以下工作:

我创建了一个尺寸等于lista

长度的矩阵
M=matrix();
length(M) = 9794^2;
dim(M) = c(9794, 9794);

#A function to calculate the interactors shared among proteins
dFun3 <- function(x,y){length(which(x%in%y))/length(x)};

#To create a matrix with percentage of intereactors shared among proteins (note that the matrix is non-symmentric, being AxB different from BxA, with A and B being proteins)

for (i in 1:length(lista))
{
    for (j in 1:length(lista))
    {
        k = dFun3(lista[[i]], lista[[j]])
        M[i,j] = k;
    }
}

现在我有一个矩阵显示AxBBxA之间的比较。我现在要做的是将蛋白质i的值与蛋白质j的值进行比较,想法是比较AxBBxA以及如果AxB is > 0.7BxA < 0.7删除A蛋白。我的方法是制作一个这样的for循环:

for (i in 1:nrow(M))
{
    for (j in 1:ncol(M))
    {
        if (x[i,] > 0.7 & x[,j] < 0.7) {x[i,] <- "-1"}
        if (x[,j] > 0.7 & x[i,] <0.7) {x[,j] <- "+1"}
    }
}

通过这种方法,我假装在+1和-1比较中去除蛋白质。

然而,这种方法需要很长时间......任何建议都会非常受欢迎。

由于

1 个答案:

答案 0 :(得分:2)

看起来combn + intersect是不错的候选人。试试这个例子:

combn(seq_along(lista),2,function(x)
         length(intersect(lista[[x[1]]],lista[[x[2]]]))/length(lista[[x[1]]]))

[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  ## gives all zeros here since 
                                   ## no intersection in your example

实际上combn将生成所有可能的索引组合,然后将它们作为一对索引提供给函数来测试交集。

combn(seq_along(lista),2)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,]    1    1    1    1    1    2    2    2    2     3     3     3     4     4     5
[2,]    2    3    4    5    6    3    4    5    6     4     5     6     5     6     6

编辑添加 lista 的输入,因为OP不提供可重现的示例:

dput(lista)
structure(list(A1CF = c(" A1CF", " AURKB", " CUGBP2", " KHSRP", 
" SYNCRIP", " TNPO2"), A2LD1 = c(" A2LD1", " PRPSAP2", " RPL15", 
" TANC1"), A2M = c(" A2M", " ADAM19", " ADAMTS1", " AMBP", " ANXA6", 
" APOE", " APP", ",B2M", " C11orf58", " CELA1", " CPB2", " CTSB", 
" CTSE", " F2", " HSPA5", " IL10", " IL1B", " KLK13", " KLK2", 
" KLK3", " KLK5", " KLKB1", " LCAT", " LEP", " LRP1", " MMP2", 
" MYOC", " NGF", " PAEP", " PDGFA", " PDGFB", " PLG", " SERPINA1", 
"  SHBG", " SPACA3", " TGFBI"), AAAS = c(" AAAS", " ARHGAP1", 
" BANF1", " CCNG2", " EP300", " HMGA1", " KPNB1", " NUP107", 
" NUP133", " NUP153", " NUP155", " NUP160", " NUP188", " NUP205", 
" NUP210", " NUP214", " NUP35", " NUP37", " NUP43", " NUP50", 
" NUP54", " NUP62", " NUP85", " NUP88", " NUP93", " NUP98", " NUPL1", 
" NUPL2", " PLK4", " POM121C", " PSIP1", " RAE1", " RAN", " RANBP2", 
" SEH1L", " TARDBP", " TPR", " TTK", " XPO1"), AAGAB = c(" AAGAB", 
" AFTPH", " EIF3C", " UNC119"), AAK1 = c(" AAK1", " ACOX3", " ADAM28", 
" ALPK3", " AURKB", " AZI2", " BMP2K", " CABC1", " CAMK2G", " DCK", 
" DCTPP1", " EIF2AK1", " FAM83A", " FER", " FRYL", " GAPVD1", 
" GFPT1", " HIPK1", " JAK1", " KIAA0195", " KIAA0528", " LIMK2", 
" LSM14A", " MAP4K2", " MAP4K5", " MAPK6", " NEK11", " NQO2", 
" NUMB", " PDE4A", " PIP4K2C", " PKN3", " PRKAA1", " PTPN18", 
" SIK2", " SIK3", " SPEG", " TAOK1", " TAOK3", " TBK1", " TBKBP1", 
" TESK2", " TMX1", " TNK1", " ZAK")), .Names = c("A1CF", "A2LD1", 
"A2M", "AAAS", "AAGAB", "AAK1"))

修改

为了在矩阵中查找row2和row1之间的比较,你可以改变这样的函数:

ll <- combn(seq_along(lista),2,FUN=function(x){
  ratio <- length(intersect(lista[[x[1]]],lista[[x[2]]]))/
        c(length(lista[[x[1]]]),length(lista[[x[2]]]))
  res <- NA                             ## value to return by default
  if (ratio[1] > 0.7 & ratio[2] < 0.7) 
      res <- x[[1]]                     ## return the index of the first protein
  if (ratio[2] > 0.7 & ratio[1] < 0.7) 
      res <- x[[2]]                     ## return the index of the second protein
  res
})
## to get the list of proteins to removed
names(lista)[ll[!is.na(ll)]]
## to remove the proteins form the origin list 
lista[!names(lista) %in% names(lista)[ll[!is.na(ll)]]] 

也许您还应该删除ll列表中的重复内容。

仅供参考 47956321 =选择(9794,2)组合数....