我正在尝试在一个键上合并2个数据集,但是如果没有匹配项,那么我想尝试另一个键,依此类推。
df1 <- data.frame(a=c(5,1,7,3),
b=c("T","T","T","F"),
c=c("F","T","F","F"))
df2 <- data.frame(x1=c(4,5,3,9),
x2=c(7,8,1,2),
x3=c("g","w","t","o"))
df1
a b c
1 5 T F
2 1 T T
3 7 T F
4 3 F F
df2
x1 x2 x3 ..
1 4 7 g ..
2 5 8 w ..
3 3 1 t ..
4 9 2 o ..
所需的输出类似
a b c x3 ..
1 5 T F w ..
2 1 T T t ..
3 7 T F g ..
4 3 F F t ..
我尝试了类似的方法
dfm <- merge(df1,df2, by.x = "a", by.y = "x1", all.x = TRUE)
dfm <- merge(dfm,df2, by.x = "a", by.y = "x2", all.x = TRUE)
但是那不是很正确。
答案 0 :(得分:3)
这实际上不是标准的合并方式。您可以通过重塑df2
使其更标准,以便仅合并一个字段
df2long <- rbind(
data.frame(a = df2$x1, df2[,-(1:2), drop=FALSE]),
data.frame(a = df2$x2, df2[,-(1:2), drop=FALSE])
)
dfm <- merge(df1, df2long, by = "a", all.x = TRUE)
答案 1 :(得分:3)
您可以执行以下操作:
matches <- lapply(df2[, c("x1", "x2")], function(x) match(df1$a, x))
# finding matches in df2$x1 and df2$x2
# notice that the code below should work with any number of columns to be matched:
# you just need to add the names here eg. df2[, paste0("x", 1:100)]
matches
$x1 [1] 2 NA NA 3 $x2 [1] NA 3 1 NA
combo <- Reduce(function(a,b) "[<-"(a, is.na(a), b[is.na(a)]), matches)
# combining the matches on "first come first served" basis
combo
[1] 2 3 1 3
cbind(df1, df2[combo,])
a b c x1 x2 x3 2 5 T F 5 8 w 3 1 T T 3 1 t 1 7 T F 4 7 g 3.1 3 F F 3 1 t
答案 2 :(得分:2)
如果我理解正确,则OP要求先尝试将a
与x1
进行匹配,然后-如果失败,则尝试将a
与x2
进行匹配。因此,a
与x1
的任何匹配都应优先于a
与x2
的匹配。
不幸的是,OP提供的样本数据集 not 并不包含用例来证明这一点。因此,我已经相应地修改了示例数据集(请参见数据部分)。
这里建议的方法是将df2
从宽格式重整为长格式(也可能是MrFlick's answer),但要使用带有参数data.table
的{{1}}连接。
mult = "first"
的列被视为关键列和优先级,可以通过df2
的{{1}}参数进行控制。重塑后,measure.vars
按照melt()
中给出的列顺序排列行:
melt()
measure.vars
请注意,library(data.table)
# define cols of df2 to use as key in order of
key_cols <- c("x1", "x2")
# reshape df2 from wide to long format
long <- melt(setDT(df2), measure.vars = key_cols, value.name = "a")
# join long with df1, pick first matches
result <- long[setDT(df1), on = "a", mult = "first"]
# clean up
setcolorder(result, names(df1))
result[, variable := NULL]
result
的原始行顺序已保留。
此外,请注意,该代码适用于任意数量的键列。键列的优先级可以轻松更改。例如,如果顺序相反,即 a b c x3
1: 5 T F w
2: 1 T T t
3: 7 T F g
4: 3 F F t
5: 0 F F <NA>
与df1
的{{1}}匹配项将被优先选择。
增强的样本数据集:
key_cols <- c("x2", "x1")
的另一行在a
中不匹配。
x2
df1
df2
还有另外一行,以证明df1 <- data.frame(a=c(5,1,7,3,0),
b=c("T","T","T","F","F"),
c=c("F","T","F","F","F"))
df1
中的匹配项优先于 a b c
1: 5 T F
2: 1 T T
3: 7 T F
4: 3 F F
5: 0 F F
中的匹配项。值df2
出现两次:在x1
列的第2行和x2
列的第5行中。
5
x1
答案 3 :(得分:0)
不确定我是否理解您的问题,而不是重复合并,而是比较可能合并的键(如果此数字> 0,则表示您有匹配项)。如果您想在第一列中找到匹配项,可以尝试以下操作:
library(tidyr)
library(purrr)
(df1 <- data.frame(a=c(5,1,7,3),
b=c("T","T","T","F"),
c=c("F","T","F","F")) )
(df2 <- data.frame(x1=c(4,5,3,9),
x2=c(7,8,1,2),
x3=c("g","w","t","o")) )
FirstColMatch<-1:ncol(df2) %>%
map(~intersect(df1$a, df2[[.x]])) %>%
map(length) %>%
detect_index(function(x)x>0)
NewDF<-merge(df1,df2,by.x="a", by.y =names(df2)[FirstColMatch])