长到宽格式聚合R tidyverse

时间:2019-12-13 13:55:54

标签: r tidyverse

您好,给出了以下数据框

library(tidyverse)

df <- data.frame(READS=rep(c('READa', 'READb', 'READc'),each=3) ,GENE=rep(c('GENEa', 'GENEb', 'GENEc'), each=3), COMMENT=rep(c('CommentA', 'CommentA', 'CommentA'),each=3))
> df
  READS  GENE  COMMENT
1 READa GENEa CommentA
2 READa GENEa CommentA
3 READa GENEa CommentA
4 READb GENEb CommentA
5 READb GENEb CommentA
6 READb GENEb CommentA
7 READc GENEc CommentA
8 READc GENEc CommentA
9 READc GENEc CommentA

我想通过“基因列”从长格式转换为宽格式聚合,以便获得以下内容

         GENEa   GENEb  GENEc
READSa     3        3     3 
READSb     3        3     3

我尝试没有成功:

 library(tidyverse)
      df %>% 
      group_by(GENE) %>% 
      select(-COMMENT) %>%
      spread(READS) 

请注意,原始数据帧很大,因此任何优化的代码都将有所帮助。

感谢您的帮助。

5 个答案:

答案 0 :(得分:2)

不太确定如何获得GENEaREADSb的3个计数,但是假设您想要该计数,可以尝试以下操作:


library(tidyverse)

df <- tibble(
  READS = rep(c("READa", "READb", "READc"), each = 3), 
  GENE = rep(c("GENEa", "GENEb", "GENEc"), each = 3), 
  COMMENT = rep(c("CommentA", "CommentA", "CommentA"), each = 3)
)
df
#> # A tibble: 9 x 3
#>   READS GENE  COMMENT 
#>   <chr> <chr> <chr>   
#> 1 READa GENEa CommentA
#> 2 READa GENEa CommentA
#> 3 READa GENEa CommentA
#> 4 READb GENEb CommentA
#> 5 READb GENEb CommentA
#> 6 READb GENEb CommentA
#> 7 READc GENEc CommentA
#> 8 READc GENEc CommentA
#> 9 READc GENEc CommentA

df %>%
  count(READS, GENE) %>%
  pivot_wider(
    names_from = GENE, values_from = n,
    values_fill = list(n = 0)
  )
#> # A tibble: 3 x 4
#>   READS GENEa GENEb GENEc
#>   <chr> <int> <int> <int>
#> 1 READa     3     0     0
#> 2 READb     0     3     0
#> 3 READc     0     0     3

reprex package(v0.3.0)于2019-12-13创建

答案 1 :(得分:2)

假设您希望每个输出单元格中的数字是输入中具有该单元格的行和列名称的行数,那么这是基数R中的单行代码。

table(df[1:2])

提供此table类对象:

       GENE
READS   GENEa GENEb GENEc
  READa     3     0     0
  READb     0     3     0
  READc     0     0     3

如果要将结果作为数据框,则:

as.data.frame.matrix(table(df[1:2]))

答案 2 :(得分:1)

library(tidyr) #v1.0.0
pivot_wider(df, -COMMENT, names_from = GENE, values_from = GENE, 
                          values_fn = list(GENE = length), values_fill = list(GENE=0))

# A tibble: 3 x 4
  READS GENEa GENEb GENEc
  <fct> <int> <int> <int>
1 READa     3     0     0
2 READb     0     3     0
3 READc     0     0     3

答案 3 :(得分:1)

带有dcast

的选项
library(data.table)
dcast(setDT(df), READS ~ GENE, length)
#   READS GENEa GENEb GENEc
#1: READa     3     0     0
#2: READb     0     3     0
#3: READc     0     0     3

答案 4 :(得分:0)

鉴于您所需输出的某些组合不存在:

df <- data.frame(READS=rep(c('READa', 'READb', 'READc'),each=3) ,GENE=rep(c('GENEa', 'GENEb', 'GENEc'), each=3), COMMENT=rep(c('CommentA', 'CommentA', 'CommentA'),each=3))

df %>%
  group_by(READS, GENE) %>% 
  summarise(count = n()) %>% 
  spread(key = "GENE", value = "count") 

会导致

  READS GENEa GENEb GENEc
1 READa     3    NA    NA
2 READb    NA     3    NA
3 READc    NA    NA     3

请注意,不建议使用传播工具,在新版本中,您应该使用pivot_wider。