我有一个找不到解决方案的问题。 这是一些示例数据:
df<-data.frame(ID1=c("A10","B73","B73","D20"),
ID2=c(NA,"B4","C05","D100"),
ID3=c(NA,"B20","C30","D41"),
ID4=c(NA,NA,"B40","D0"),
ID5=c(NA,NA,NA,"D10"),
Score=c(15,376,102,30))
>df
ID1 ID2 ID3 ID4 ID5 Score
1 A10 <NA> <NA> <NA> <NA> 15
2 B73 B4 B20 <NA> <NA> 376
3 B73 C05 C30 B40 <NA> 102
4 D20 D100 D41 D0 D10 30
我还拥有具有不同ID号的数据,这些数据与ID
中的某些df
匹配并与Score
匹配。看起来像这样:
df_match<-data.frame(ID_Match=c("A10","B4","B20","E20","A355","D0","C30"),
Score_Match=c(30,55,200,120,113,23,98))
>df_match
ID_Match Score_Match
1 A10 30
2 B4 55
3 B20 200
4 E20 120
5 A355 113
6 D0 23
7 C30 98
我想做的是让R在df
中搜索ID匹配项,如果有匹配项,请将匹配的ID
和Score
放在新列中。如果一行包含多个ID匹配项,则选择最右边一列的ID匹配项。所以看起来像这样:
> df_Final
ID1 ID2 ID3 ID4 ID5 Score ID_Match Score_Match
1 A10 <NA> <NA> <NA> <NA> 15 A10 30
2 B73 B4 B20 <NA> <NA> 376 B20 200
3 B73 C05 C30 B40 <NA> 102 C30 98
4 D20 D100 D41 D0 D10 30 D0 23
我找到了类似的答案
IDColumns <- 1:5
d <- df[,IDColumns] == "ID"
或
df$Check <- (rowSums(df[,startsWith(names(df),"ID")]=="ID") >= 1)
但是我找到的大多数答案只在搜索一个特定字符串的匹配项时出现。有人可以帮我吗?
答案 0 :(得分:1)
首先,匹配矩阵会很有用。
MX <- t(apply(df[, -6], 1, function(x) x %in% df_match$ID_Match))
# [,1] [,2] [,3] [,4] [,5]
# [1,] TRUE FALSE FALSE FALSE FALSE
# [2,] FALSE TRUE TRUE FALSE FALSE
# [3,] FALSE FALSE TRUE FALSE FALSE
# [4,] FALSE FALSE FALSE TRUE FALSE
现在,我们需要“最右边的列”,可以在其中使用sum()
。
idx <- apply(MX, 1, function(x) {
if (sum(x) > 1)
tail(which(x == TRUE), 1)
else if (sum(x) == 1)
which(x == TRUE)
else NA
})
最后,仅使用cbind()
%in%
相应的值。
res <- cbind(df,
df_match[which(df_match$ID_Match %in%
sapply(1:nrow(df), function(x) df[x, idx[x]])), ])
结果
> res
ID1 ID2 ID3 ID4 ID5 Score ID_Match Score_Match
1 A10 <NA> <NA> <NA> <NA> 15 A10 30
3 B73 B4 B20 <NA> <NA> 376 B20 200
6 B73 C05 C30 B40 <NA> 102 D0 23
7 D20 D100 D41 D0 D10 30 C30 98
答案 1 :(得分:0)
不确定在任何情况下是否都行得通,但也许仍然有帮助
df<-data.frame(ID1=c("A10","B73","B73","D20"),
ID2=c(NA,"B4","C05","D100"),
ID3=c(NA,"B20","C30","D41"),
ID4=c(NA,NA,"B40","D0"),
ID5=c(NA,NA,NA,"D10"),
Score=c(15,376,102,30))
df_match<-data.frame(ID_Match=c("A10","B4","B20","E20","A355","D0","C30"),
Score_Match=c(30,55,200,120,113,23,98))
# create backup for the results
df2 = df
# create a dummy-column as an "ID" for each row
df$rownumber = 1:NROW(df)
# convert Data to longformat and get rid of all those IDs, that are NA
df = reshape2::melt(df, measure.vars = names(df)[which(names(df) != "rownumber")], id.vars = "rownumber", na.rm = T)
df %>% arrange(rownumber)
# find the matching scores for all IDs left
df = merge(df, df_match, by.x = "value", by.y = "ID_Match", all.x = T)
# remove all ids, that didn't have a match in df_match
df = df %>% filter(!is.na(Score_Match))
# remove the substring ID from each ID-Variable, so we can use it as a numeric
df$variable = as.numeric(as.character(gsub("ID", "", df$variable)))
# now lets select the ID most far right. Its the one with the highest ID<Number>
df = df %>% group_by(rownumber) %>% filter(variable == max(variable)) %>% arrange(rownumber)
# attach the data to the original file
df2$ID_Match = df$value
df2$score_Match = df$Score_Match
df2
# > df2
# ID1 ID2 ID3 ID4 ID5 Score ID_Match score_Match
# 1 A10 <NA> <NA> <NA> <NA> 15 A10 30
# 2 B73 B4 B20 <NA> <NA> 376 B20 200
# 3 B73 C05 C30 B40 <NA> 102 C30 98
# 4 D20 D100 D41 D0 D10 30 D0 23
如果某些行在任何ID中都没有匹配项,则可能会造成麻烦。 在这种情况下,添加df2 $ rownumber = 1:NROW(df2)并将df与df2按行号进行匹配,而不是直接附加可能会有所帮助(我希望:))