比较并合并两个数据帧

时间:2017-02-16 12:35:24

标签: r

我在R中有以下两个数据帧:

df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5))
colnames(df1) = c("X", "Y", "Z", "score")

df1 
  X  Y  Z score
1 A  1  6     1
2 A 11 20     2
3 A 21 30     3
4 B 35 40     4
5 B 45 60     5

df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10), c(5, 20, 30, 60, 30, 40, 60, 20), c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"))
colnames(df2) = c("X", "Y", "Z", "out")

df2
  X  Y  Z out
1 A  1  5  x1
2 A  6 20  x2
3 A 21 30  x3
4 A 50 60  x4 
5 B 20 30  x5
6 B 31 40  x6
7 B 50 60  x7
8 C 10 20  x8

对于df1中的每一行,我想检查:

  • 与' X'中的值匹配和任何其他' X'价值来自df2
  • 如果以上情况属实:我想检查来自' Y'和' Z'在价值范围内' Y'和' Z'来自df2
  • 如果两者都是真的:那么我想添加' out'到df1。

这是输出的样子:

output = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), c("x1, x2", "x2", "x3", "x4", "x5"))
colnames(output) = c("X", "Y", "Z", "score", "out")

  X  Y  Z score    out
1 A  1  6     1 x1, x2
2 A 11 20     2     x2
3 A 21 30     3     x3
4 B 35 40     4     x6
5 B 45 60     5     x7

原来的df1保留了一个额外的列' out'这是添加的。

'输出'中的第1行包含' x1,x2'在列中' out'。原因:列' X'中的值之间存在匹配范围1到6与df2中的第1行和第2行重叠。

我之前(Compare values from two dataframes and merge)问过这个问题,建议使用foverlaps功能。但是由于df1和df2之间的列不同以及df2中的额外行,我无法使其工作。

3 个答案:

答案 0 :(得分:2)

以下是两种可能的方法:a)使用新实现的非equi 连接功能,b)foverlaps,因为您特别提到了..

a)非等联接

dt2[dt1, on=.(X, Z>=Y, Y<=Z), 
      .(score, out=paste(out, collapse=",")), 
    by=.EACHI]

其中dt1dt2是与df1df2对应的data.tables。请注意,您必须在结果中还原列名ZY(因为列名来自dt2,但值来自dt1

根据dt2参数提供的条件找到dt1对应每行on的匹配行,.()评估每个<这些匹配行的/ em>(因为by=.EACHI)。

b)foverlaps

setkey(dt1, X, Y, Z)
olaps <- foverlaps(dt2, dt1, type="any", nomatch=0L)
olaps[, .(score=score[1L], out=paste(out, collapse=",")), by=.(X,Y,Z)]

答案 1 :(得分:1)

library(dplyr)

df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), 
                 c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), stringsAsFactors = F)
colnames(df1) = c("X", "Y", "Z", "score")

df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10), 
                 c(5, 20, 30, 60, 30, 40, 60, 20), 
                 c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"), stringsAsFactors = F)
colnames(df2) = c("X", "Y", "Z", "out")


df1 %>%
  left_join(df2, by="X") %>%         # join on main column
  rowwise() %>%                      # for each row
  mutate(counter = sum(seq(Y.x, Z.x) %in% seq(Y.y, Z.y))) %>%   # get how many elements of those ranges overlap
  filter(counter > 0) %>%            # keep rows with overlap
  group_by(X, Y.x, Z.x, score) %>%   # for each combination of those columns
  summarise(out = paste(out, collapse=", ")) %>%                # combine out column
  ungroup() %>%
  rename(Y = Y.x,
         Z = Z.x)

# # A tibble: 5 × 5
#       X     Y     Z score    out
#    <chr> <dbl> <dbl> <dbl> <chr>
# 1     A     1     6     1 x1, x2
# 2     A    11    20     2     x2
# 3     A    21    30     3     x3
# 4     B    35    40     4     x6
# 5     B    45    60     5     x7

上述过程基于dplyr包,涉及join以及一些分组和过滤。如果您的初始数据集(df1df2)非常大,那么join将创建一个更大的数据集,需要一些时间来创建。

另请注意,此过程适用于character而非factor个变量。如果factor变量尝试加入具有不同级别的character变量,则该过程可能会将factor变量转换为Gson gson = new Gson(); Detail details = gson.fromJson(response, Detail .class);

我建议您一步一步地运行链式命令,看看它是如何工作的,并发现我是否遗漏了可能导致代码中的错误的任何内容。

答案 2 :(得分:0)

以下是使用sqldf

的其他选项
library(sqldf)
xx=sqldf('select t1.*,t2.out from df1 t1 left join df2 t2 on t1.X=t2.X and ((t2.Y between t1.Y and t1.Z) or (t2.Z between t1.Y and t1.Z))')
aggregate(xx[ncol(xx)], xx[-ncol(xx)], FUN = function(X) paste(unique(X), collapse=", "))