我有三个向量:
position <- c(13, 13, 24, 20, 24, 6, 13)
my_string_allele <- c("T>A", "T>A", "G>C", "C>A", "A>G", "A>G", "G>T")
position_ref <- c("12006", "1108", "13807", "1970", "9030", "2222", "4434")
我要创建一个表格(从最小位置开始),如下所示。我要考虑每个位置的每个my_string_allele
列的出现次数,并在position_ref
列中包含其对应的position_ref。最简单的方法是什么?
position T>A position_ref G>C position_ref C>A position_ref A>G position_ref G>T position_ref
6 1 2222
13 2 12006, 1108 1 4434
20 1 1970
24 1 13807 1 9030
答案 0 :(得分:2)
这是一种spread()
方法,该方法使用mutate_all()
将数据扩展为较宽的格式,以计算出现的次数。
数据
library(tidyverse)
df <- data.frame(position, my_string_allele, position_ref, stringsAsFactors = F)
代码
df %>% group_by(position, my_string_allele) %>%
mutate(position_ref = paste(position_ref, collapse = ", ")) %>%
distinct() %>%
spread(my_string_allele, position_ref) %>%
mutate_all(funs(N = if_else(is.na(.), NA_integer_, lengths(str_split(., ", ")))))
输出
position `A>G` `C>A` `G>C` `G>T` `T>A` `A>G_N` `C>A_N` `G>C_N` `G>T_N` `T>A_N`
<dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int>
1 6 2222 NA NA NA NA 1 NA NA NA NA
2 13 NA NA NA 4434 12006, 1108 NA NA NA 1 2
3 20 NA 1970 NA NA NA NA 1 NA NA NA
4 24 9030 NA 13807 NA NA 1 NA 1 NA NA
(您可以按列名称对列进行排序,以获得在问题中显示的输出。)
答案 1 :(得分:2)
全面披露:我正在用data.table
修改@DarrenTsai的答案的一部分,以提供发生的次数(因为他的答案中没有出现)。使用data.table
:
library(data.table)
df <- data.frame(position, my_string_allele, position_ref, stringsAsFactors = F)
setDT(df)
df[, `:=`(position_ref = paste(.N, paste(position_ref, collapse = ", "))),
by = c("position", "my_string_allele")] %>%
unique(., by = c("position", "my_string_allele", "position_ref")) %>%
dcast(position ~ my_string_allele, value.var = "position_ref")
结果:
position A>G C>A G>C G>T T>A
1: 6 1 2222 <NA> <NA> <NA> <NA>
2: 13 <NA> <NA> <NA> 1 4434 2 12006, 1108
3: 20 <NA> 1 1970 <NA> <NA> <NA>
4: 24 1 9030 <NA> 1 13807 <NA> <NA>
使用dplyr
(主要基于@DarrenTsai的回答,也应该投票赞成他):
library(dplyr)
df %>% group_by(position, my_string_allele) %>%
mutate(position_ref = paste(n(), paste(position_ref, collapse = ", "))) %>%
distinct() %>%
tidyr::spread(my_string_allele, position_ref)