R查找字符串匹配多个列,然后选择最右边的列匹配

时间:2019-03-01 09:21:20

标签: r string dataframe mutate

我有一个找不到解决方案的问题。 这是一些示例数据:

df<-data.frame(ID1=c("A10","B73","B73","D20"),
               ID2=c(NA,"B4","C05","D100"),
               ID3=c(NA,"B20","C30","D41"),
               ID4=c(NA,NA,"B40","D0"),
               ID5=c(NA,NA,NA,"D10"),
               Score=c(15,376,102,30))
>df
  ID1  ID2  ID3  ID4  ID5 Score
1 A10 <NA> <NA> <NA> <NA>    15
2 B73   B4  B20 <NA> <NA>   376
3 B73  C05  C30  B40 <NA>   102
4 D20 D100  D41   D0  D10    30

我还拥有具有不同ID号的数据,这些数据与ID中的某些df匹配并与Score匹配。看起来像这样:

df_match<-data.frame(ID_Match=c("A10","B4","B20","E20","A355","D0","C30"),
               Score_Match=c(30,55,200,120,113,23,98))
>df_match
  ID_Match Score_Match
1      A10          30
2       B4          55
3      B20         200
4      E20         120
5     A355         113
6       D0          23
7      C30          98

我想做的是让R在df中搜索ID匹配项,如果有匹配项,请将匹配的IDScore放在新列中。如果一行包含多个ID匹配项,则选择最右边一列的ID匹配项。所以看起来像这样:

> df_Final
  ID1  ID2  ID3  ID4  ID5 Score ID_Match Score_Match
1 A10 <NA> <NA> <NA> <NA>    15      A10          30
2 B73   B4  B20 <NA> <NA>   376      B20         200
3 B73  C05  C30  B40 <NA>   102      C30          98
4 D20 D100  D41   D0  D10    30       D0          23

我找到了类似的答案

IDColumns <- 1:5
d <- df[,IDColumns] == "ID"

df$Check <- (rowSums(df[,startsWith(names(df),"ID")]=="ID") >= 1)

但是我找到的大多数答案只在搜索一个特定字符串的匹配项时出现。有人可以帮我吗?

2 个答案:

答案 0 :(得分:1)

首先,匹配矩阵会很有用。

MX <- t(apply(df[, -6], 1, function(x) x %in% df_match$ID_Match))

#       [,1]  [,2]  [,3]  [,4]  [,5]
# [1,]  TRUE FALSE FALSE FALSE FALSE
# [2,] FALSE  TRUE  TRUE FALSE FALSE
# [3,] FALSE FALSE  TRUE FALSE FALSE
# [4,] FALSE FALSE FALSE  TRUE FALSE

现在,我们需要“最右边的列”,可以在其中使用sum()

idx <- apply(MX, 1, function(x) {
  if (sum(x) > 1)
    tail(which(x == TRUE), 1)
  else if (sum(x) == 1)
    which(x == TRUE)
  else NA
})

最后,仅使用cbind() %in%相应的值。

res <- cbind(df, 
             df_match[which(df_match$ID_Match %in% 
                              sapply(1:nrow(df), function(x) df[x, idx[x]])), ])

结果

> res
  ID1  ID2  ID3  ID4  ID5 Score ID_Match Score_Match
1 A10 <NA> <NA> <NA> <NA>    15      A10          30
3 B73   B4  B20 <NA> <NA>   376      B20         200
6 B73  C05  C30  B40 <NA>   102       D0          23
7 D20 D100  D41   D0  D10    30      C30          98

答案 1 :(得分:0)

不确定在任何情况下是否都行得通,但也许仍然有帮助

    df<-data.frame(ID1=c("A10","B73","B73","D20"),
               ID2=c(NA,"B4","C05","D100"),
               ID3=c(NA,"B20","C30","D41"),
               ID4=c(NA,NA,"B40","D0"),
               ID5=c(NA,NA,NA,"D10"),
               Score=c(15,376,102,30))


df_match<-data.frame(ID_Match=c("A10","B4","B20","E20","A355","D0","C30"),
                     Score_Match=c(30,55,200,120,113,23,98))

# create backup for the results
df2 = df

# create a dummy-column as an "ID" for each row
df$rownumber = 1:NROW(df)

# convert Data to longformat and get rid of all those IDs, that are NA
df = reshape2::melt(df, measure.vars = names(df)[which(names(df) != "rownumber")], id.vars = "rownumber", na.rm = T)
df %>% arrange(rownumber)

# find the matching scores for all IDs left
df = merge(df, df_match, by.x = "value", by.y = "ID_Match", all.x = T)
# remove all ids, that didn't have a match in df_match
df = df %>% filter(!is.na(Score_Match))
# remove the substring ID from each ID-Variable, so we can use it as a numeric
df$variable = as.numeric(as.character(gsub("ID", "", df$variable)))

# now lets select the ID most far right. Its the one with the highest ID<Number>
df = df %>% group_by(rownumber) %>% filter(variable == max(variable)) %>% arrange(rownumber)

# attach the data to the original file
df2$ID_Match = df$value
df2$score_Match = df$Score_Match
df2

# > df2
#   ID1  ID2  ID3  ID4  ID5 Score ID_Match score_Match
# 1 A10 <NA> <NA> <NA> <NA>    15      A10          30
# 2 B73   B4  B20 <NA> <NA>   376      B20         200
# 3 B73  C05  C30  B40 <NA>   102      C30          98
# 4 D20 D100  D41   D0  D10    30       D0          23

如果某些行在任何ID中都没有匹配项,则可能会造成麻烦。 在这种情况下,添加df2 $ rownumber = 1:NROW(df2)并将df与df2按行号进行匹配,而不是直接附加可能会有所帮助(我希望:))