对于列表中的一组变量的每个组合,计算该组合与R中的另一个变量之间的相关性

时间:2018-06-01 09:12:58

标签: r dataframe bioinformatics combinatorics

在R中,我想通过比较2个变量同时保留系统发育信号来产生相关系数。

我认为这样做的最初方式不是计算效率,我认为有一个更简单,但我没有R的技能来做它。

我有一个csv文件,如下所示:

+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
|            Species            | OGT |  Domain  |       A       |      C       |      D       |      E       |      F       |      G       |      H       |      I       |      K       |       L       |      M       |      N       |      P       |      Q       |      R       |      S       |      T       |      V       |      W       |      Y       |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Aeropyrum pernix              |  95 | Archaea  |  9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 |   2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 |  6.266067034 | 4.2052190807 | 9.2692433532 |  1.318690698 | 3.5614200159 |
| Argobacterium fabrum          |  26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 |  9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984 |
| Anaeromyxobacter dehalogenans |  27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 |  2.940143349 | 2.3473650439 |  10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289 |
| Aquifex aeolicus              |  85 | Bacteria |  5.8730327277 |  0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987 |
| Archaeoglobus fulgidus        |  83 | Archaea  |  7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 |  4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 |  9.4826333048 | 2.6014466253 |  3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148 |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+

我想要做的是,对于20个单字母列中的百分比的每种可能组合(氨基酸,所以1000万种组合)。是计算CSV中每个不同组合和OGT变量之间的相关性....(同时保留系统发育信号)

我目前的代码是:

library(parallel)
library(dplyr)
library(tidyr)
library(magrittr)
library(ape)
library(geiger)
library(caper)



taxonomynex <- read.nexus("taxonomyforzeldospecies.nex")

zeldodata <- read.csv("COMPLETECOPYFORR.csv")

Species <- dput(zeldodata)

SpeciesLong <-
  Species %>%
  gather(protein, proportion,
         A:Y) %>%
  arrange(Species)


S <- unique(SpeciesLong$protein)


Scombi <- unlist(lapply(seq_along(S),
                        function(x) combn(S, x, FUN = paste0, collapse = "")))


joint_protein <- function(protein_combo, data){
  sum(data$proportion[vapply(data$protein,
                             grepl,
                             logical(1),
                             protein_combo)])
}


SplitSpecies <-
  split(SpeciesLong,
        SpeciesLong$Species)

cl <- makeCluster(detectCores() - 1)

clusterExport(cl, c("Scombi", "joint_protein"))


SpeciesAggregate <-
  parLapply(cl,
            X = SplitSpecies,
            fun = function(data){
              X <- lapply(Scombi,
                          joint_protein,
                          data)
              names(X) <- Scombi
              as.data.frame(X)
            })
Species <- cbind(Species, SpeciesAggregate)

`

尝试将每种组合物加入到记忆中,然后计算每种酸的每种比例的总和,但这需要永远完成并在完成前崩溃。

我认为最好将相关联合效应转化为载体,然后打印出每个物种的每种不同组合的相对系数,但我不知道这样做的最佳方法在R。

我还打算使用ape包保留一个系统发育信号,使用类似的东西:

pglsModel <- gls(OGT ~ AminoAcidCombination, correlation = corBrownian(phy = taxonomynex),
    data = zeldodata, method = "ML")
summary(pglsModel)

道歉,如果有人有任何建议,这是多么不清楚,非常感谢!

修改:Link to taxonomyforzeldospecies.nex

dput(Zeldodata)的输出:

  1              Species              OGT    Domain          A              C              D              E              F              G              H              I              K               L              M              N              P              Q              R              S              T              V              W              Y        
     ------------------------------- ----- ---------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- 
  2   Aeropyrum pernix                 95   Archaea     9.7659115711   0.6720465616   4.3895390781   7.6501943794   2.9344881615   8.8666657183   1.5011817208   5.6901432494   4.1428307243   11.0604191603     2.21143353   1.9387130928   5.1038552753   1.6855017182   7.7664358772    6.266067034   4.2052190807   9.2692433532    1.318690698   3.5614200159  
  3   Argobacterium fabrum             26   Bacteria   11.5698896021   0.7985475923   5.5884500155   5.8165463343   4.0512504104   8.2643271309   2.0116736244   5.7962804605   3.8931525401    9.9250463349   2.5980609708   2.9846761128   4.7828063605   3.1262365491   6.5684282943   5.9454781844   5.3740045968   7.3382308193   1.2519739683   2.3149400984  
  4   Anaeromyxobacter dehalogenans    27   Bacteria   16.0337898849   0.8860252895   5.1368827707   6.1864992608   2.9730203513   9.3167603253   1.9360386851    2.940143349   2.3473650439    10.898494736   1.6343905351   1.5247123262   6.3580285706   2.4715303021   9.2639057482   4.1890063803   4.3992339725   8.3885969061   1.2890166336   1.8265589289  
  5   Aquifex aeolicus                 85   Bacteria    5.8730327277    0.795341216   4.3287799008   9.6746388172   5.1386954322   6.7148035486   1.5438364179   7.3358775924   9.4641440609   10.5736658776   1.9263080969   3.6183861236   4.0518679067   2.0493569604   4.9229955632   4.7976564501   4.2005259246   7.9169763709   0.9292167138   4.1438942987  
  6   Archaeoglobus fulgidus           83   Archaea     7.8742687687   1.1695110027   4.9165979364   8.9548767369    4.568636662   7.2640358917   1.4998752909   7.2472039919   6.8957233203    9.4826333048   2.6014466253    3.206476915   3.8419576418   1.7789787933   5.7572748236   5.4763351139   4.1490633048   8.6330814159   1.0325605451   3.6494619148  

1 个答案:

答案 0 :(得分:1)

这将为您提供一个长数据框,每个组合和每个Species的总和(在我的机器上大约需要35秒)......

zeldodata <- 
  Species %>%
  gather(protein, proportion, A:Y) %>%
  group_by(Species) %>% 
  mutate(combo = sapply(1:n(), function(i) combn(protein, i, FUN = paste0, collapse = ""))) %>% 
  mutate(sum = sapply(1:n(), function(i) combn(proportion, i, FUN = sum))) %>% 
  unnest() %>% 
  select(-protein, -proportion)

分别计算每个物种并将数据保存到磁盘然后读取每个物种并将它们组合的示例...

library(readr)
library(dplyr)
library(tidyr)
library(purrr)

# read in CSV file
zeldodata <-
  read_delim(
    delim = "|",
    trim_ws = TRUE,
    col_names = TRUE,
    col_types = "cicdddddddddddddddddddd", 
    file = "Species                       | OGT |  Domain  |       A       |      C       |      D       |      E       |      F       |      G       |      H       |      I       |      K       |       L       |      M       |      N       |      P       |      Q       |      R       |      S       |      T       |      V       |      W       |      Y
    Aeropyrum pernix              |  95 | Archaea  |  9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 |   2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 |  6.266067034 | 4.2052190807 | 9.2692433532 |  1.318690698 | 3.5614200159
    Argobacterium fabrum          |  26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 |  9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984
    Anaeromyxobacter dehalogenans |  27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 |  2.940143349 | 2.3473650439 |  10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289
    Aquifex aeolicus              |  85 | Bacteria |  5.8730327277 |  0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987
    Archaeoglobus fulgidus        |  83 | Archaea  |  7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 |  4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 |  9.4826333048 | 2.6014466253 |  3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148"
  )

# save an RDS file for each species
for(species in unique(zeldodata$Species)) {
  zeldodata %>%
    filter(Species == species) %>% 
    gather(protein, proportion, A:Y) %>%
    mutate(combo = sapply(1:n(), function(i) combn(protein, i, FUN = paste0, collapse = ""))) %>% 
    mutate(sum = sapply(1:n(), function(i) combn(proportion, i, FUN = sum))) %>% 
    unnest() %>% 
    select(-protein, -proportion) %>% 
    saveRDS(file = paste0(species, ".RDS"))
}

# read in and combine all the RDS files
zeldodata <- 
  list.files(pattern = "\\.RDS") %>%
  map(read_rds) %>% 
  bind_rows()