Question

我没有看到任何解决我问题的帖子。

我有一个两列数据框。具体来说，它有两个因素，分别为11985和20200。将两个因子的水平组合在一起，总共得到849472个观测值。以下是数据框的示例：

Category    Gene
BP0000      Fp91000
BP0001      Fp82000
BP0002      Fp70000
BP0010      Fp72000
BP0021      Fp30000
BP0021      Fp30020 
BP0001      Fp30000
BP0000      Fp82000

我想保留一个原始的类别，每个因子只重复一次，而在另一个列中我想在同一个单元格中匹配一个类别的所有基因。这是我想要的格式：

Category    Gene
BP0000      Fp91000 Fp82000
BP0001      Fp82000 Fp30000
BP0002      Fp70000
BP0010      Fp72000
BP0021      Fp30000 Fp30020

我已经尝试过匹配，但我只得到基因列的一个匹配，并且有多个匹配。如果这已经在其他问题上发布了，我很抱歉，但我没有看到这样的事情。

Answer 1

这是使用dplyr

的解决方案

library(dplyr)

df <- data.frame(category = c("a", "a", "a", "b", "b", "b"),
                 value = c("c", "d", "e", "f", "g", "h"),
                 stringsAsFactors = FALSE)

df_out <- df %>%
  group_by(category) %>%
  mutate(value = paste(value, collapse=" ")) %>%
  unique()

编辑：对于大型数据框，unique（）非常慢。这样做效果更好。

df_out <- df %>%
  group_by(category) %>%
  mutate(value = paste(value, collapse=" ")) %>%
  group_by(category, value) %>%
  summarise()

Answer 2

让df成为您的数据框。您可能想尝试：

getme<-function(x){
  r<-paste(df[df$Category==x,]$Gene,collapse = " ")
  return(r)
}

final<-data.frame(cbind(unique(as.character(df$Category)) ,unique(apply(df[1],1,getme))))
names(final)<-c("Category","Gene")

final是您预期的数据框架。

Answer 3

要使用tidyverse，tidyr和dplyr软件包扩展purrr选项，您可以将Genes存储为每个类别的列表列。然后可以用它来进一步操纵。

注意：我已将基因和类别保存为字符而不是因子，这对于如此大的数据集似乎没有效率。

以列表列的形式存储，为方便起见，我添加了基因数量的计数：

library(tidyverse)
dataLC <- data %>% 
  tidyr::nest(Gene, .key=GeneListCol) %>% 
  mutate(n_genes = map_int(GeneListCol, ~max(row_number(.$Gene))))

# A tibble: 5 x 3
  Category      GeneListCol n_genes
     <chr>           <list>   <int>
1   BP0000 <tibble [2 x 1]>       2
2   BP0001 <tibble [2 x 1]>       2
3   BP0002 <tibble [1 x 1]>       1
4   BP0010 <tibble [1 x 1]>       1
5   BP0021 <tibble [2 x 1]>       2

这可以像使用purrr函数一样使用，在这种形式下可能最有用。

要将选定类别的基因提取为矢量，这似乎是最有用的输出，您可以执行以下操作：

map(dataLC$GeneListCol, "Gene")[dataLC$Category=="BP0001"][[1]]
[1] "Fp82000" "Fp30000"

要获得包含所有基因的单个字符串（请注意，这不是“宽格式”），请执行以下操作：

dataLC %>% 
  mutate(geneList = map_chr(GeneListCol, ~paste(.$Gene, collapse =" "))) %>% 
  select(-GeneListCol)
# A tibble: 5 x 3
  Category n_genes        geneList
     <chr>   <int>           <chr>
1   BP0000       2 Fp91000 Fp82000
2   BP0001       2 Fp82000 Fp30000
3   BP0002       1         Fp70000
4   BP0010       1         Fp72000
5   BP0021       2 Fp30000 Fp30020

使用purrr地图函数来了解列表列和操作需要一段时间，但它可能非常有用。请参阅https://jennybc.github.io/purrr-tutorial/index.html

上的教程

将长格式数据帧转换为宽格式，但保留R中的列数

3 个答案: