我有什么? - 虚构数据,可重现的集合
signal1 <- c(rep(1:6))
signal2 <- c(rep(7:12))
signal3 <- c(rep(13:18))
signal4 <- c(rep(19:24))
tag <- c('str1','str2','str3','str4','str5','str6')
gene <- c('ABC','ABC','ABC;DEF','ABC;DEF','DEF','DEF')
df <- data.frame(signal1,signal2,signal3,signal4,some_coulmn,gene)
df
signal1 signal2 signal3 signal4 tag gene
1 7 13 19 str1 ABC
2 8 14 20 str2 ABC
3 9 15 21 str3 ABC;DEF
4 10 16 22 str4 ABC;DEF
5 11 17 23 str5 DEF
6 12 18 24 str6 DEF
我想要的是什么?
首先,df重复的行,其中分号存在于行gene
列中。
signal1 signal2 signal3 signal4 tag gene
1 7 13 19 str1 ABC
2 8 14 20 str2 ABC
3 9 15 21 str3 ABC;DEF
3 9 15 21 str3 ABC;DEF
4 10 16 22 str4 ABC;DEF
4 10 16 22 str4 ABC;DEF
5 11 17 23 str5 DEF
6 12 18 24 str6 DEF
复制后的行顺序并不重要。它们可以在df结束时添加。
除此之外 - 删除行中不必要的基因:
如你所见,我希望拥有明确的基因组,没有任何重叠。如果标签存在于两个或更多基因,则需要每个基因的额外行!
signal1 signal2 signal3 signal4 tag gene
1 7 13 19 str1 ABC
2 8 14 20 str2 ABC
3 9 15 21 str3 ABC
3 9 15 21 str3 DEF
4 10 16 22 str4 ABC
4 10 16 22 str4 DEF
5 11 17 23 str5 DEF
6 12 18 24 str6 DEF
这是我的尝试,但遗憾的是,它无法正常运作。更有效的是它只能用分号分隔的两个基因。
的情况下不会工作GENE1;GENE2;GENE3
或更多
library(stringr)
df_tmp <- df
sapply(1:nrow(df_tmp), function(x) ifelse(str_detect(as.character(df_tmp[x,22]), ';'), df <- rbind(df_tmp, df_tmp[x,22]), df_tmp[x,22]))
你能给我一个提示怎么做......
答案 0 :(得分:2)
我们可以使用strsplit
和tidyr::unnest
:
library(tidyverse);
df %>%
mutate(gene = strsplit(as.character(gene), ";")) %>%
unnest()
# signal1 signal2 signal3 signal4 tag gene
#1 1 7 13 19 str1 ABC
#2 2 8 14 20 str2 ABC
#3 3 9 15 21 str3 ABC
#4 3 9 15 21 str3 DEF
#5 4 10 16 22 str4 ABC
#6 4 10 16 22 str4 DEF
#7 5 11 17 23 str5 DEF
#8 6 12 18 24 str6 DEF
说明:strsplit
根据gene
拆分列";"
条目,并将条目存储在list
中,然后使用tidyr::unnest
进行展开。
>2
以分号分隔的条目的示例:
df <- structure(list(signal1 = 1:6, signal2 = 7:12, signal3 = 13:18,
signal4 = 19:24, tag = structure(1:6, .Label = c("str1",
"str2", "str3", "str4", "str5", "str6"), class = "factor"),
gene = structure(c(1L, 1L, 2L, 2L, 3L, 4L), .Label = c("ABC",
"ABC;DEF", "DEF", "DEF;GHI;JKL"), class = "factor")), .Names = c("signal1",
"signal2", "signal3", "signal4", "tag", "gene"), row.names = c(NA,
-6L), class = "data.frame");
df;
# signal1 signal2 signal3 signal4 tag gene
#1 1 7 13 19 str1 ABC
#2 2 8 14 20 str2 ABC
#3 3 9 15 21 str3 ABC;DEF
#4 4 10 16 22 str4 ABC;DEF
#5 5 11 17 23 str5 DEF
#6 6 12 18 24 str6 DEF;GHI;JKL
library(tidyverse);
df %>%
mutate(gene = strsplit(as.character(gene), ";")) %>%
unnest()
# signal1 signal2 signal3 signal4 tag gene
#1 1 7 13 19 str1 ABC
#2 2 8 14 20 str2 ABC
#3 3 9 15 21 str3 ABC
#4 3 9 15 21 str3 DEF
#5 4 10 16 22 str4 ABC
#6 4 10 16 22 str4 DEF
#7 5 11 17 23 str5 DEF
#8 6 12 18 24 str6 DEF
#9 6 12 18 24 str6 GHI
#10 6 12 18 24 str6 JKL
答案 1 :(得分:2)
基础R方式可能是拆分列,根据每个元素,子集和更新的长度创建索引
splt = strsplit(gene, ";")
idx = rep(seq_len(nrow(df)), lengths(splt))
df = df[idx,]
df$gene = unlist(splt)
rownames(df) = NULL # clean up duplicated row names