如何基于R中的类似值合并两个数据帧

时间:2017-01-04 20:32:03

标签: r dataframe merge

我在R中比较新,我有一个关于合并两个数据帧的问题,它确实包含来自两个域(mz和rt)的相似数值数据,但不一样。 这是一个描述我的问题的例子:

mz1    <- c(seq(100, 190, by = 10))
rt1    <- c(seq(1, 10, by = 1))
value1 <- runif(10, min = 100, max = 100000)
mz2    <- mz1 + runif(10, -0.1, 0.1)
rt2    <- rt1 + runif(10, -0.2, 0.2)
value2 <- runif(10, min = 100, max = 100000)

df1 <- as.data.frame(cbind(mz1, rt1, value1))
df2 <- as.data.frame(cbind(mz2, rt2, value2))


df1
   mz1 rt1    value1
1  100   1 44605.646
2  110   2 13924.598
3  120   3 35727.265
4  130   4 75175.652
5  140   5 25221.724
6  150   6 29080.653
7  160   7  3170.749
8  170   8 10184.708
9  180   9 48055.072
10 190  10 77644.865


df2
        mz2      rt2   value2
1  100.0243 1.043092 58099.49
2  110.0514 2.164753 76397.67
3  120.0258 2.838141 43901.05
4  130.0921 4.044322 34543.96
5  139.9577 5.023823 53086.10
6  150.0170 6.061794 13929.27
7  160.0884 6.828779 60905.61
8  170.0440 7.932000 66627.20
9  180.0872 9.116425 44587.62
10 189.9694 9.834091 51186.03

我想合并来自df1和df2的所有行,这些行在rt域中具有差值&lt; = 0.1并且在mz域中差异<= 0.05。 此外,如果有两行或更多行符合此条件,则应合并距两个域的距离最小的行(可能需要进行额外的计算:距离= sqrt(mz ^ 2 + rt ^ 2))和剩余的行如果存在,行必须找到不同的合并伙伴。 如果没有合并伙伴,请保留该行并将“NA”填入缺失值。

到目前为止我尝试过:

merge.data.frame(df1, df2, by.x = c("mz1", "rt1"), by.y = c("mz2", "rt2") , all = T)

        mz1 rt1    value1      rt2   value2
1  100.0000   1 44605.646       NA       NA
2  100.0243  NA        NA 1.043092 58099.49
3  110.0000   2 13924.598       NA       NA
4  110.0514  NA        NA 2.164753 76397.67
5  120.0000   3 35727.265       NA       NA
6  120.0258  NA        NA 2.838141 43901.05
7  130.0000   4 75175.652       NA       NA
8  130.0921  NA        NA 4.044322 34543.96
9  139.9577  NA        NA 5.023823 53086.10
10 140.0000   5 25221.724       NA       NA
11 150.0000   6 29080.653       NA       NA
12 150.0170  NA        NA 6.061794 13929.27
13 160.0000   7  3170.749       NA       NA
14 160.0884  NA        NA 6.828779 60905.61
15 170.0000   8 10184.708       NA       NA
16 170.0440  NA        NA 7.932000 66627.20
17 180.0000   9 48055.072       NA       NA
18 180.0872  NA        NA 9.116425 44587.62
19 189.9694  NA        NA 9.834091 51186.03
20 190.0000  10 77644.865       NA       NA

这给了我至少一个正确格式的数据框,其中包含NA,其中不可能合并。

如果有人帮我解决这个问题会很棒!

问候

更新

好的,我会牢记这一点。谢谢你到目前为止。我尝试了以下这个想法:

#select data in joined which has no partner
no_match_df1 <- anti_join(joined, df2)
no_match_df1 <- no_match_df1[1:3]

#select data in df2 which has been excluded due to duplication
collist <- c("mz2", "rt2", "value2")
dublicates <- joined[complete.cases(joined[collist]), collist]
dublicates <- anti_join(df2, dublicates)


#repetition for joining
joined2 <- fuzzy_join(no_match_df1, dublicates, multi_by = c("mz1" = "mz2", "rt1" = "rt2"),
                     multi_match_fun = mmf, mode = "full")

joined2 <- group_by(joined2, mz1, rt1) %>%
  mutate(min_dist = min(dist))
head(joined2)

joined2 <- filter(joined2, dist == min_dist | is.na(dist)) %>%
  select(-dist, -min_dist)
head(joined2)

#select only rows with new match or where dublicates coulnd't find a partner

add <- subset(joined2, !is.na(joined2$mz2) | !is.na(joined2$mz2) &  !is.na(joined2$mz1))

#add to joined
##I need some help here, how can I update the existing joined data frame?

Maybe this helps

也许我们可以像以前一样使用no_match_df1加入duplicates,只需通过覆盖现有joined数据框中的特定行来添加结果。 最后,我们必须将此过程重复为日志,因为duplicates的长度为&gt;。

1 个答案:

答案 0 :(得分:1)

根据joran的建议,我找到了使用fuzzyjoin包的解决方案。我按如下方式创建了数据集:

set.seed(123)
mz1    <- c(seq(100, 190, by = 10))
rt1    <- c(seq(1, 10, by = 1))
value1 <- runif(10, min = 100, max = 100000)
mz2    <- mz1 + runif(10, -0.1, 0.1)
rt2    <- rt1 + runif(10, -0.2, 0.2)
value2 <- runif(10, min = 100, max = 100000)

df1 <- as.data.frame(cbind(mz1, rt1, value1))
df2 <- as.data.frame(cbind(mz2, rt2, value2))

(有点侧面评论:你做了一个很好的可重复的例子。唯一的缺点就是你没有设置种子,这是上述代码的唯一区别。)

为了确保找到两个匹配项的情况,我在df2添加了一行:

df2 <- rbind(df2, c(180.001, 9.09, 0))

现在,我可以使用函数fuzzy_join()来合并数据框:

library(fuzzyjoin)
joined <- fuzzy_join(df1, df2, multi_by = c("mz1" = "mz2", "rt1" = "rt2"),
                     multi_match_fun = mmf, mode = "full")

请注意,语法与join()的{​​{1}}非常相似。但是有一个重要的区别:你可以为dplyr提供一个函数,它确定两行是否匹配。它返回一个数据框,其中第一列必须是逻辑的。此列确定两行是否匹配。所有其他列都只是添加到结果数据框中。我将此函数定义如下:

multi_match_fun

如果满足您指定的条件,您可以看到列mmf <- function(x, y) { mz_dist <- abs(x[, 1] - y[, 1]) rt_dist <- abs(x[, 2] - y[, 2]) out <- data_frame(merge = rt_dist <= 0.1 & mz_dist < 0.05, dist = sqrt(mz_dist^2 + rt_dist^2)) return (out) } (名称是任意的)是merge。此外,添加包含距离的列以供以后使用。如果没有匹配,我设置TRUE以获得mode = "full"值。

结果如下:

NA

在第3行和第4行中,您可以看到,在这种情况下确实有两个匹配。从列head(joined) ## mz1 rt1 value1 mz2 rt2 value2 dist ## 1 110 2 78851.68 109.9907 2.077121 90239.67 0.07768406 ## 2 120 3 40956.79 120.0355 3.056203 69101.46 0.06648308 ## 3 180 9 55188.36 179.9656 8.915664 31886.28 0.09108803 ## 4 180 9 55188.36 180.0010 9.090000 0.00 0.09000556 ## 5 100 1 28828.99 NA NA NA NA ## 6 130 4 88313.44 NA NA NA NA 可以看出第4行是我们要保留的行。这意味着第3行应被视为未找到匹配项,而distmz1rt1列应填充value1。我通过NAmz1对行进行分组,然后为每个组添加最小距离值来完成此操作:

rt1

有效匹配的行就是那些,其中library(dplyr) joined <- group_by(joined, mz1, rt1) %>% mutate(min_dist = min(dist)) head(joined) ## Source: local data frame [6 x 8] ## Groups: mz1, rt1 [5] ## ## mz1 rt1 value1 mz2 rt2 value2 dist min_dist ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 110 2 78851.68 109.9907 2.077121 90239.67 0.07768406 0.07768406 ## 2 120 3 40956.79 120.0355 3.056203 69101.46 0.06648308 0.06648308 ## 3 180 9 55188.36 179.9656 8.915664 31886.28 0.09108803 0.09000556 ## 4 180 9 55188.36 180.0010 9.090000 0.00 0.09000556 0.09000556 ## 5 100 1 28828.99 NA NA NA NA NA ## 6 130 4 88313.44 NA NA NA NA NA dist相同。此外,我们也不应该忽略min_distdist的行。这可以按如下方式完成:

NA

根据您的数据的外观,也可能是,在双匹配的情况下,dbls <- which(joined$dist != joined$min_dist) joined[dbls, c("mz1", "rt1", "value1")] <- NA joined <- select(joined, -dist, -min_dist) head(joined) ## Source: local data frame [6 x 6] ## Groups: mz1, rt1 [6] ## ## mz1 rt1 value1 mz2 rt2 value2 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 110 2 78851.68 109.9907 2.077121 90239.67 ## 2 120 3 40956.79 120.0355 3.056203 69101.46 ## 3 NA NA NA 179.9656 8.915664 31886.28 ## 4 180 9 55188.36 180.0010 9.090000 0.00 ## 5 100 1 28828.99 NA NA NA ## 6 130 4 88313.44 NA NA NA mz1的值不一致,但另一对值的确如此。然后,您还必须使用其他分组重复上述步骤。