组合列,根据其他df更新列,填充NA

时间:2017-09-19 06:42:21

标签: r merge dplyr

在开始时我想注意到我在SO上找到了多个解决方案,但没有一个满足我的期望。

我必须要DF:

1

E                           F              G        H
chr1_100203723_100203724    NA             NA       NA
chr1_100212951_100212952    rs760764323    A,G,     0.000008,0.999992,
chr1_10032235_10032236      NA             NA       NA
chr1_100327060_100327061    NA             NA       NA
chr1_100346889_100346890    NA             NA       NA
chr1_100347237_100347238    rs749372877    C,G,T,   0.000008,0.000008,0.999983,
chr1_100357190_100357191    NA             NA       NA
chr1_100358057_100358058    NA             NA       NA
chr2_182852606_182852607    NA             NA       NA
chr2_202492077_202492078    NA             NA       NA
chr2_203760838_203760839    NA             NA       NA
chr2_215976351_215976352    NA             NA       NA
chr2_220354644_220354645    NA             NA       NA
chr2_234749403_234749404    NA             NA       NA
chr2_11802110_11802111      NA             NA       NA
chr2_31167747_31167748      NA             NA       NA

2

E                           F               G       H
chr1_100203723_100203724    NA              NA      NA
chr1_100212951_100212952    NA              NA      NA
chr1_10032235_10032236      NA              NA      NA
chr1_100327060_100327061    NA              NA      NA
chr1_100346889_100346890    NA              NA      NA
chr1_100347237_100347238    NA              NA      NA
chr1_100357190_100357191    NA              NA      NA
chr1_100358057_100358058    NA              NA      NA
chr2_182852606_182852607    rs773426830     C,T,    0.999967,0.000033,
chr2_202492077_202492078    rs750583431     C,G,    0.000013,0.999987,
chr2_203760838_203760839    NA              NA      NA
chr2_215976351_215976352    rs113648834     C,T,    0.999934,0.000066,
chr2_220354644_220354645    NA              NA      NA
chr2_234749403_234749404    NA              NA      NA
chr2_11802110_11802111      rs371327070     A,G,    0.000044,0.999956,
chr2_31167747_31167748      rs201375957     A,C,T,  0.000008,0.999887,0.000105,

期望的输出:

E                           F               G       H
chr1_100203723_100203724    NA              NA      NA
chr1_100212951_100212952    rs760764323     A,G,    0.000008,0.999992,
chr1_10032235_10032236      NA              NA      NA
chr1_100327060_100327061    NA              NA      NA
chr1_100346889_100346890    NA              NA      NA
chr1_100347237_100347238    rs749372877     C,G,T,  0.000008,0.000008,0.999983,
chr1_100357190_100357191    NA              NA      NA
chr1_100358057_100358058    NA              NA      NA
chr2_182852606_182852607    rs773426830     C,T,    0.999967,0.000033,
chr2_202492077_202492078    rs750583431     C,G,    0.000013,0.999987,
chr2_203760838_203760839    NA              NA      NA
chr2_215976351_215976352    rs113648834     C,T,    0.999934,0.000066,
chr2_220354644_220354645    NA              NA      NA
chr2_234749403_234749404    NA              NA      NA
chr2_11802110_11802111      rs371327070     A,G,    0.000044,0.999956,
chr2_31167747_31167748      rs201375957     A,C,T,  0.000008,0.999887,0.000105,

如您所见,DF1由DF2列F,G,H更新,其中E列是我的唯一索引。我尝试merge()但是这个功能没有更新我的行,它将DF2的列添加到DF1。我还尝试使用data.tabletidyverse进行更新,我的行已更新,但其他行已转到NAs ...最后我决定使用嵌套{lapply()做简单的ifelse() {1}}但是,我不知道如何同时更新所有三列,对于每个DF中超过50000行的数据而言,这是非常缓慢的......

到目前为止我做了什么:

DF1$F <- sapply(1:nrow(DF1), function(i) ifelse(DF1[i,1]==DF2[i,1] & is.na(DF1[i,1]), DF2[i,1], DF[i,1]))

3 个答案:

答案 0 :(得分:4)

你可以在基地R中做到这一点:

as.data.frame(Map(function(x,y) ifelse(is.na(x),y,x),DF1,DF2))

使用库purrr,您可以拥有更漂亮更紧凑的形式(请参阅Soto的答案,了解更为紧凑的dplyr):

library(purrr)
map2_df(DF1,DF2,~ifelse(is.na(.x),.y,.x))

在这两种情况下(技术上第一种情况为data.frame,第二种情况为tibble):

<强>输出

                            E           F      G                           H
1    chr1_100203723_100203724        <NA>   <NA>                        <NA>
2    chr1_100212951_100212952 rs760764323   A,G,          0.000008,0.999992,
3    chr1_10032235_10032236        <NA>   <NA>                        <NA>
4    chr1_100327060_100327061        <NA>   <NA>                        <NA>
5    chr1_100346889_100346890        <NA>   <NA>                        <NA>
6    chr1_100347237_100347238 rs749372877 C,G,T, 0.000008,0.000008,0.999983,
7    chr1_100357190_100357191        <NA>   <NA>                        <NA>
8    chr1_100358057_100358058        <NA>   <NA>                        <NA>
9    chr2_182852606_182852607 rs773426830   C,T,          0.999967,0.000033,
10   chr2_202492077_202492078 rs750583431   C,G,          0.000013,0.999987,
11   chr2_203760838_203760839        <NA>   <NA>                        <NA>
12   chr2_215976351_215976352 rs113648834   C,T,          0.999934,0.000066,
13   chr2_220354644_220354645        <NA>   <NA>                        <NA>
14   chr2_234749403_234749404        <NA>   <NA>                        <NA>
15   chr2_11802110_11802111 rs371327070   A,G,          0.000044,0.999956,
16   chr2_31167747_31167748 rs201375957 A,C,T, 0.000008,0.999887,0.000105,

数据

DF1 <- read.table(text="E                           F              G        H
chr1_100203723_100203724    NA             NA       NA
chr1_100212951_100212952    rs760764323    A,G,     0.000008,0.999992,
chr1_10032235_10032236      NA             NA       NA
chr1_100327060_100327061    NA             NA       NA
chr1_100346889_100346890    NA             NA       NA
chr1_100347237_100347238    rs749372877    C,G,T,   0.000008,0.000008,0.999983,
chr1_100357190_100357191    NA             NA       NA
chr1_100358057_100358058    NA             NA       NA
chr2_182852606_182852607    NA             NA       NA
chr2_202492077_202492078    NA             NA       NA
chr2_203760838_203760839    NA             NA       NA
chr2_215976351_215976352    NA             NA       NA
chr2_220354644_220354645    NA             NA       NA
chr2_234749403_234749404    NA             NA       NA
chr2_11802110_11802111      NA             NA       NA
chr2_31167747_31167748      NA             NA       NA",header=T,stringsAsFactors=F)


DF2 <- read.table(text="E                           F               G       H
chr1_100203723_100203724    NA              NA      NA
chr1_100212951_100212952    NA              NA      NA
chr1_10032235_10032236      NA              NA      NA
chr1_100327060_100327061    NA              NA      NA
chr1_100346889_100346890    NA              NA      NA
chr1_100347237_100347238    NA              NA      NA
chr1_100357190_100357191    NA              NA      NA
chr1_100358057_100358058    NA              NA      NA
chr2_182852606_182852607    rs773426830     C,T,    0.999967,0.000033,
chr2_202492077_202492078    rs750583431     C,G,    0.000013,0.999987,
chr2_203760838_203760839    NA              NA      NA
chr2_215976351_215976352    rs113648834     C,T,    0.999934,0.000066,
chr2_220354644_220354645    NA              NA      NA
chr2_234749403_234749404    NA              NA      NA
chr2_11802110_11802111      rs371327070     A,G,    0.000044,0.999956,
chr2_31167747_31167748      rs201375957     A,C,T,  0.000008,0.999887,0.000105,",header=T,stringsAsFactors=F)

答案 1 :(得分:4)

来自coalesce的{​​{1}}函数就是这样做的。我确信我们可以使用dplyr函数来映射2个数据框,但这里有一个使用基数R purrr

mapply

给出,

DF1[-1] <- mapply(dplyr::coalesce, DF1[-1], DF2[-1])

注意:正如@Moody_Mudskipper所述,生成新数据框而不更改 E F G H 1 chr1_100203723_100203724 <NA> <NA> <NA> 2 chr1_100212951_100212952 rs760764323 A,G, 0.000008,0.999992, 3 chr1_10032235_10032236 <NA> <NA> <NA> 4 chr1_100327060_100327061 <NA> <NA> <NA> 5 chr1_100346889_100346890 <NA> <NA> <NA> 6 chr1_100347237_100347238 rs749372877 C,G,T, 0.000008,0.000008,0.999983, 7 chr1_100357190_100357191 <NA> <NA> <NA> 8 chr1_100358057_100358058 <NA> <NA> <NA> 9 chr2_182852606_182852607 rs773426830 C,T, 0.999967,0.000033, 10 chr2_202492077_202492078 rs750583431 C,G, 0.000013,0.999987, 11 chr2_203760838_203760839 <NA> <NA> <NA> 12 chr2_215976351_215976352 rs113648834 C,T, 0.999934,0.000066, 13 chr2_220354644_220354645 <NA> <NA> <NA> 14 chr2_234749403_234749404 <NA> <NA> <NA> 15 chr2_11802110_11802111 rs371327070 A,G, 0.000044,0.999956, 16 chr2_31167747_31167748 rs201375957 A,C,T, 0.000008,0.999887,0.000105, purrr的{​​{1}}版本将是< / p>

DF1

答案 2 :(得分:0)

另一种天真的做法是使用paste0

> df1 <- data.frame(E = c('A','B','C'), F=c('0.9,1',NA,NA), G=c(NA,'0.98,0.34',NA), H=c(NA,'0.98,0.34',NA), stringsAsFactors = F)
> df2 <- data.frame(E = c('A','B','C'), F=c(NA,'1,3',NA), G=c(NA,NA,'5,6,7'), H=c(NA,NA,NA), stringsAsFactors = F)



    > df1[is.na(df1)] <- ''
    > df2[is.na(df2)] <- ''
    > 
    > mapply(paste, df1[-1], df2[-1])
     F        G            H           
[1,] "0.9,1 " " "          " "         
[2,] " 1,3"   "0.98,0.34 " "0.98,0.34 "
[3,] " "      " 5,6,7"     " "         

根据mapply

Sotos建议进行了更新