我有一份蛋白质及其相互作用的列表,我有兴趣知道不同蛋白质中共享的相互作用者的百分比。
我的蛋白质和相互作用者列表如下所示:
head(lista)
$`A1CF `
[1] " A1CF" " APOBEC1" " CUGBP2" " KHSRP" " SYNCRIP" " TNPO2"
$`A2LD1 `
[1] " A2LD1" " PRPSAP2" " RPL15" " TANC1"
$`A2M `
[1] " A2M" " ADAM19" " ADAMTS1" " AMBP" " ANXA6" " APOE" " APP" " B2M" " C11orf58" " CELA1" " CPB2" " CTSB" " CTSE"
[14] " F2" " HSPA5" " IL10" " IL1B" " KLK13" " KLK2" " KLK3" " KLK5" " KLKB1" " LCAT" " LEP" " LRP1" " MMP2"
[27] " MYOC" " NGF" " PAEP" " PDGFA" " PDGFB" " PLG" " SERPINA1" " SHBG" " SPACA3" " TGFBI"
$`AAAS `
[1] " AAAS" " ARHGAP1" " BANF1" " CCNG2" " EP300" " HMGA1" " KPNB1" " NUP107" " NUP133" " NUP153" " NUP155" " NUP160" " NUP188" " NUP205"
[15] " NUP210" " NUP214" " NUP35" " NUP37" " NUP43" " NUP50" " NUP54" " NUP62" " NUP85" " NUP88" " NUP93" " NUP98" " NUPL1" " NUPL2"
[29] " PLK4" " POM121C" " PSIP1" " RAE1" " RAN" " RANBP2" " SEH1L" " TARDBP" " TPR" " TTK" " XPO1"
$`AAGAB `
[1] " AAGAB" " AFTPH" " EIF3C" " UNC119"
$`AAK1 `
[1] " AAK1" " ACOX3" " ADAM28" " ALPK3" " AURKB" " AZI2" " BMP2K" " CABC1" " CAMK2G" " DCK" " DCTPP1" " EIF2AK1" " FAM83A"
[14] " FER" " FRYL" " GAPVD1" " GFPT1" " HIPK1" " JAK1" " KIAA0195" " KIAA0528" " LIMK2" " LSM14A" " MAP4K2" " MAP4K5" " MAPK6"
[27] " NEK11" " NQO2" " NUMB" " PDE4A" " PIP4K2C" " PKN3" " PRKAA1" " PTPN18" " SIK2" " SIK3" " SPEG" " TAOK1" " TAOK3"
[40] " TBK1" " TBKBP1" " TESK2" " TMX1" " TNK1" " ZAK"
为了获得蛋白质中共享的相互作用者的百分比,我已经完成了以下工作:
我创建了一个尺寸等于lista
M=matrix();
length(M) = 9794^2;
dim(M) = c(9794, 9794);
#A function to calculate the interactors shared among proteins
dFun3 <- function(x,y){length(which(x%in%y))/length(x)};
#To create a matrix with percentage of intereactors shared among proteins (note that the matrix is non-symmentric, being AxB different from BxA, with A and B being proteins)
for (i in 1:length(lista))
{
for (j in 1:length(lista))
{
k = dFun3(lista[[i]], lista[[j]])
M[i,j] = k;
}
}
现在我有一个矩阵显示AxB
和BxA
之间的比较。我现在要做的是将蛋白质i的值与蛋白质j的值进行比较,想法是比较AxB
与BxA
以及如果AxB is > 0.7
和BxA < 0.7
删除A蛋白。我的方法是制作一个这样的for循环:
for (i in 1:nrow(M))
{
for (j in 1:ncol(M))
{
if (x[i,] > 0.7 & x[,j] < 0.7) {x[i,] <- "-1"}
if (x[,j] > 0.7 & x[i,] <0.7) {x[,j] <- "+1"}
}
}
通过这种方法,我假装在+1和-1比较中去除蛋白质。
然而,这种方法需要很长时间......任何建议都会非常受欢迎。
由于
答案 0 :(得分:2)
看起来combn
+ intersect
是不错的候选人。试试这个例子:
combn(seq_along(lista),2,function(x)
length(intersect(lista[[x[1]]],lista[[x[2]]]))/length(lista[[x[1]]]))
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## gives all zeros here since
## no intersection in your example
实际上combn
将生成所有可能的索引组合,然后将它们作为一对索引提供给函数来测试交集。
combn(seq_along(lista),2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5
[2,] 2 3 4 5 6 3 4 5 6 4 5 6 5 6 6
编辑添加 lista 的输入,因为OP不提供可重现的示例:
dput(lista)
structure(list(A1CF = c(" A1CF", " AURKB", " CUGBP2", " KHSRP",
" SYNCRIP", " TNPO2"), A2LD1 = c(" A2LD1", " PRPSAP2", " RPL15",
" TANC1"), A2M = c(" A2M", " ADAM19", " ADAMTS1", " AMBP", " ANXA6",
" APOE", " APP", ",B2M", " C11orf58", " CELA1", " CPB2", " CTSB",
" CTSE", " F2", " HSPA5", " IL10", " IL1B", " KLK13", " KLK2",
" KLK3", " KLK5", " KLKB1", " LCAT", " LEP", " LRP1", " MMP2",
" MYOC", " NGF", " PAEP", " PDGFA", " PDGFB", " PLG", " SERPINA1",
" SHBG", " SPACA3", " TGFBI"), AAAS = c(" AAAS", " ARHGAP1",
" BANF1", " CCNG2", " EP300", " HMGA1", " KPNB1", " NUP107",
" NUP133", " NUP153", " NUP155", " NUP160", " NUP188", " NUP205",
" NUP210", " NUP214", " NUP35", " NUP37", " NUP43", " NUP50",
" NUP54", " NUP62", " NUP85", " NUP88", " NUP93", " NUP98", " NUPL1",
" NUPL2", " PLK4", " POM121C", " PSIP1", " RAE1", " RAN", " RANBP2",
" SEH1L", " TARDBP", " TPR", " TTK", " XPO1"), AAGAB = c(" AAGAB",
" AFTPH", " EIF3C", " UNC119"), AAK1 = c(" AAK1", " ACOX3", " ADAM28",
" ALPK3", " AURKB", " AZI2", " BMP2K", " CABC1", " CAMK2G", " DCK",
" DCTPP1", " EIF2AK1", " FAM83A", " FER", " FRYL", " GAPVD1",
" GFPT1", " HIPK1", " JAK1", " KIAA0195", " KIAA0528", " LIMK2",
" LSM14A", " MAP4K2", " MAP4K5", " MAPK6", " NEK11", " NQO2",
" NUMB", " PDE4A", " PIP4K2C", " PKN3", " PRKAA1", " PTPN18",
" SIK2", " SIK3", " SPEG", " TAOK1", " TAOK3", " TBK1", " TBKBP1",
" TESK2", " TMX1", " TNK1", " ZAK")), .Names = c("A1CF", "A2LD1",
"A2M", "AAAS", "AAGAB", "AAK1"))
修改强>
为了在矩阵中查找row2和row1之间的比较,你可以改变这样的函数:
ll <- combn(seq_along(lista),2,FUN=function(x){
ratio <- length(intersect(lista[[x[1]]],lista[[x[2]]]))/
c(length(lista[[x[1]]]),length(lista[[x[2]]]))
res <- NA ## value to return by default
if (ratio[1] > 0.7 & ratio[2] < 0.7)
res <- x[[1]] ## return the index of the first protein
if (ratio[2] > 0.7 & ratio[1] < 0.7)
res <- x[[2]] ## return the index of the second protein
res
})
## to get the list of proteins to removed
names(lista)[ll[!is.na(ll)]]
## to remove the proteins form the origin list
lista[!names(lista) %in% names(lista)[ll[!is.na(ll)]]]
也许您还应该删除ll
列表中的重复内容。
仅供参考 47956321 =选择(9794,2)组合数....