我在R中有以下两个数据帧:
df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5))
colnames(df1) = c("X", "Y", "Z", "score")
df1
X Y Z score
1 A 1 6 1
2 A 11 20 2
3 A 21 30 3
4 B 35 40 4
5 B 45 60 5
df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10), c(5, 20, 30, 60, 30, 40, 60, 20), c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"))
colnames(df2) = c("X", "Y", "Z", "out")
df2
X Y Z out
1 A 1 5 x1
2 A 6 20 x2
3 A 21 30 x3
4 A 50 60 x4
5 B 20 30 x5
6 B 31 40 x6
7 B 50 60 x7
8 C 10 20 x8
对于df1中的每一行,我想检查:
这是输出的样子:
output = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), c("x1, x2", "x2", "x3", "x4", "x5"))
colnames(output) = c("X", "Y", "Z", "score", "out")
X Y Z score out
1 A 1 6 1 x1, x2
2 A 11 20 2 x2
3 A 21 30 3 x3
4 B 35 40 4 x6
5 B 45 60 5 x7
原来的df1保留了一个额外的列' out'这是添加的。
'输出'中的第1行包含' x1,x2'在列中' out'。原因:列' X'中的值之间存在匹配范围1到6与df2中的第1行和第2行重叠。
我之前(Compare values from two dataframes and merge)问过这个问题,建议使用foverlaps
功能。但是由于df1和df2之间的列不同以及df2中的额外行,我无法使其工作。
答案 0 :(得分:2)
以下是两种可能的方法:a)使用新实现的非equi 连接功能,b)foverlaps
,因为您特别提到了..
a)非等联接
dt2[dt1, on=.(X, Z>=Y, Y<=Z),
.(score, out=paste(out, collapse=",")),
by=.EACHI]
其中dt1
和dt2
是与df1
和df2
对应的data.tables。请注意,您必须在结果中还原列名Z
和Y
(因为列名来自dt2
,但值来自dt1
。
根据dt2
参数提供的条件找到dt1
对应每行on
的匹配行,.()
评估每个<这些匹配行的/ em>(因为by=.EACHI
)。
b)foverlaps
setkey(dt1, X, Y, Z)
olaps <- foverlaps(dt2, dt1, type="any", nomatch=0L)
olaps[, .(score=score[1L], out=paste(out, collapse=",")), by=.(X,Y,Z)]
答案 1 :(得分:1)
library(dplyr)
df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45),
c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), stringsAsFactors = F)
colnames(df1) = c("X", "Y", "Z", "score")
df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10),
c(5, 20, 30, 60, 30, 40, 60, 20),
c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"), stringsAsFactors = F)
colnames(df2) = c("X", "Y", "Z", "out")
df1 %>%
left_join(df2, by="X") %>% # join on main column
rowwise() %>% # for each row
mutate(counter = sum(seq(Y.x, Z.x) %in% seq(Y.y, Z.y))) %>% # get how many elements of those ranges overlap
filter(counter > 0) %>% # keep rows with overlap
group_by(X, Y.x, Z.x, score) %>% # for each combination of those columns
summarise(out = paste(out, collapse=", ")) %>% # combine out column
ungroup() %>%
rename(Y = Y.x,
Z = Z.x)
# # A tibble: 5 × 5
# X Y Z score out
# <chr> <dbl> <dbl> <dbl> <chr>
# 1 A 1 6 1 x1, x2
# 2 A 11 20 2 x2
# 3 A 21 30 3 x3
# 4 B 35 40 4 x6
# 5 B 45 60 5 x7
上述过程基于dplyr
包,涉及join
以及一些分组和过滤。如果您的初始数据集(df1
,df2
)非常大,那么join
将创建一个更大的数据集,需要一些时间来创建。
另请注意,此过程适用于character
而非factor
个变量。如果factor
变量尝试加入具有不同级别的character
变量,则该过程可能会将factor
变量转换为Gson gson = new Gson();
Detail details = gson.fromJson(response, Detail .class);
。
我建议您一步一步地运行链式命令,看看它是如何工作的,并发现我是否遗漏了可能导致代码中的错误的任何内容。
答案 2 :(得分:0)
以下是使用sqldf
library(sqldf)
xx=sqldf('select t1.*,t2.out from df1 t1 left join df2 t2 on t1.X=t2.X and ((t2.Y between t1.Y and t1.Z) or (t2.Z between t1.Y and t1.Z))')
aggregate(xx[ncol(xx)], xx[-ncol(xx)], FUN = function(X) paste(unique(X), collapse=", "))