我一直在寻找解决方案并且一直在尝试,但我似乎无法执行我应该做的简单任务。
我有两个数据框,其格式类似于以下玩具示例
DF1 = data.frame(A=c("cats","dogs",NA,"dogs"), B=c("kittens","puppies","kittens",NA), C=c(88,99,101,110))
A B C
1 cats kittens 88
2 dogs puppies 99
3 NA kittens 101
4 dogs NA 110
DF2 = data.frame(D=c(1,2), A=c("cats","dogs"), B=c("kittens","puppies"))
D A B
1 1 cats kittens
2 2 dogs puppies
我希望合并两个数据集,使输出为:
A B C D
1 cats kittens 88 1
2 dogs puppies 99 2
3 dogs NA 110 2
4 NA kittens 101 1
换句话说,任何带有标签A ==“cats”或B ==“kittens”的行都将映射到D列中的1,任何带有A ==“dogs”或B ==“puppies”的行将被映射到2.
我使用了命令
merge(DF1, DF2, by=c("A","B"), all.x=TRUE)
然而,这不正确地匹配第3行和第4行,只有第1行和第2行。我得到了输出
A B C D
1 cats kittens 88 1
2 dogs puppies 99 2
3 dogs NA 110 NA
4 NA kittens 101 NA
请注意我正在使用的实际数据集非常长。实际上DF1超过1,000,000行,而DF2每行超过300,000行数千行,因此可以扩展的解决方案是我真正需要的。
答案 0 :(得分:3)
也许你可以尝试这些方面:
temp <- merge(DF1, DF2, by=c("A","B"), all.x=TRUE)
within(temp, {
M1 <- c("cats", "kittens")
D <- ifelse(A %in% M1 | B %in% M1, 1, 2)
rm(M1)
})
# A B C D
# 1 cats kittens 88 1
# 2 dogs puppies 99 2
# 3 dogs <NA> 110 2
# 4 <NA> kittens 101 1
如果您需要的不仅仅是这两个选项,您可以嵌套ifelse
语句。
答案 1 :(得分:2)
DF1[which(DF1$A=="cats"|DF1$B=="kittens"), "D"] <- DF2[which(DF2$A=="cats"|DF2$B=="kittens"), "D"]
DF1[which(DF1$A=="dogs"|DF1$B=="puppies"), "D"] <- DF2[which(DF2$A=="dogs"|DF2$B=="puppies"), "D"]
DF1
#-------
A B C D
1 cats kittens 88 1
2 dogs puppies 99 2
3 <NA> kittens 101 1
4 dogs <NA> 110 2
官能化:
idxpick <- function(a,b) DF1[which(DF1$A==a|DF1$B==b), "D"] <<- # Yes, I feel guilty.
DF2[which(DF2$A==a|DF2$B==b), "D"]
DF1 = data.frame(A=c("cats","dogs",NA,"dogs"),
B=c("kittens","puppies","kittens",NA),
C=c(88,99,101,110))
DF2 = data.frame(D=c(1,2), A=c("cats","dogs"), B=c("kittens","puppies"))
apply(DF2, 1, function(rr) idxpick(rr["A"], rr["B"]) )
#------------
[1] 1 2
DF1
A B C D
1 cats kittens 88 1
2 dogs puppies 99 2
3 <NA> kittens 101 1
4 dogs <NA> 110 2
答案 2 :(得分:2)
这是一种不同的方法:
library(functional)
partial.merge <- function(DF1, DF2) {
common.cols <- intersect(names(DF1), names(DF2))
result.col <- names(DF2)[!(names(DF2) %in% common.cols)]
# This can only handle one result column:
stopifnot(length(result.col) == 1)
# Merge in each common column, one at a time.
# The identical operation is done for each common column, so Reduce is useful:
r <- Reduce(function(D, C) merge(D, DF2[c(C, result.col)], by=c(C), all.x=TRUE), x=common.cols, init=DF1)
# The merge created cols like c('D.x', 'D.y'). These are the columns:
merge.cols <- paste(result.col, c('x', 'y'), sep='.')
# The .x and .y columns are partial, put them together:
r[[result.col]] <- rowMeans(r[merge.cols], na.rm=TRUE)
# Remove the temporaries:
for (i in merge.cols) {
r[[i]] <- NULL
}
return(r)
}
partial.merge(DF1, DF2)
## B A C D
## 1 kittens cats 88 1
## 2 kittens <NA> 101 1
## 3 puppies dogs 99 2
## 4 <NA> dogs 110 2