如何通过基因名称比较两个数据集df1和df2,并从df2中提取每个基因名称的相应值并将其插入到df1
df1 <-
Genes sample.ID chrom loc.start loc.end num.mark
Klri2 LO.WGS 1 3010000 173490000 8430
Rrs1 LO.WGS 1 3010000 173490000 8430
Serpin LO.WGS 1 3010000 173490000 8430
Myoc LO.WGS 1 3010000 173490000 8430
St18 LO.WGS 1 3010000 173490000 8430
df2 <-
RL pValue. chr start end CNA Genes
2 2.594433 1 129740006 129780779 gain Klri2
2 3.941399 1 130080653 130380997 gain Serpin,St18,Myoc
df3<-
Genes sample.ID chrom loc.start loc.end num.mark RL pValue CNA
Klri2 LO.WGS 1 3010000 173490000 8430 2 2.594433 gain
Rrs1 LO.WGS 1 3010000 173490000 8430 0 0 0
Serpin LO.WGS 1 3010000 173490000 8430 2 3.941399 gain
Myoc LO.WGS 1 3010000 173490000 8430 2 3.941399 gain
St18 LO.WGS 1 3010000 173490000 8430 2 3.941399 gain
答案 0 :(得分:5)
你可以尝试:
library(splitstackshape)
out <- cSplit(df2, "Genes", sep = ",", "long")
这将以正确的格式重塑df2
(每个基因一行):
# RL pValue. chr start end CNA Genes
#1: 2 2.594433 1 129740006 129780779 gain Klri2
#2: 2 3.941399 1 130080653 130380997 gain Serpin
#3: 2 3.941399 1 130080653 130380997 gain St18
#4: 2 3.941399 1 130080653 130380997 gain Myoc
然后您只需使用merge()
中的left_join()
或dplyr
:
library(dplyr)
df3 <- left_join(df1, out)
如果您想NA
替换0
,可以执行以下操作:
df3 <- left_join(df1, out) %>% mutate_each(funs(ifelse(is.na(.), 0, .)))
或者,如果您更喜欢子集:
df3 <- left_join(df1, out) %>% (function(x) { x[is.na(x)] <- 0; x })
答案 1 :(得分:4)
这是一个合并操作,但首先你必须以正确的格式引入df2
,每个基因将包含一行(而不是用逗号分隔的多个基因的单个条目)。从tidyr
包unnest()
df2 <- tidyr::unnest(
transform(df2, Genes = strsplit(as.character(df2$Genes), ",")),
Genes)
结果如下所示
df2
# RL pValue. chr start end CNA Genes
#1 2 2.594433 1 129740006 129780779 gain Klri2
#2 2 3.941399 1 130080653 130380997 gain Serpin
#3 2 3.941399 1 130080653 130380997 gain St18
#4 2 3.941399 1 130080653 130380997 gain Myoc
现在,您只需使用merge(df1, df2, all.x = TRUE)
中的left_join
或dplyr
(或data.table
等其他套餐,具体取决于您要学习的内容)。请注意,这会将NA
引入您想要零的位置,但您可以轻松替换它们。