具有match()或类似的数据帧之间的部分字符串匹配以保留匹配位置

时间:2016-04-12 17:12:28

标签: r string-matching

使用函数匹配()我想在不同数据帧的两个字符向量之间执行部分字符串匹配。 匹配值的位置必须保留,因为它稍后用于引用相邻列,我发现函数match()最适合它。

我可以进行精确的字符串匹配:

## exact string matching
name <-  c("AAB", "AAC", "AAD","AAE")
meaning1 <- c('circular','parallel','perpendicular','none') 
meaning2 <- c('surface','longitudinal','transverse','not detected') 
meaning3 <- c('category 1','category 1','category 1','category 2') 
referenceData <- data.frame(name, meaning1, meaning2, meaning3, stringsAsFactors = FALSE)
name2 <- c("AAB", "AAC", "AAD","AAE")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> referenceData
  name      meaning1     meaning2   meaning3
1  AAB      circular      surface category 1
2  AAC      parallel longitudinal category 1
3  AAD perpendicular   transverse category 1
4  AAE          none not detected category 2
> myData 
  name2
1   AAB
2   AAC
3   AAD
4   AAE

matched <- match(myData[ , 'name2'],  referenceData[ ,'name'])
> matched
[1] 1 2 3 4

myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
  name2        newCol      newCol2
1   AAB      circular      surface
2   AAC      parallel longitudinal
3   AAD perpendicular   transverse
4   AAE          none not detected

然而,真实数据的复杂程度很小,只能部分匹配,所以我的上述方法不起作用:

name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData 
                    name2
1 AAB Monday and Thursday
2            AAC Saturday
3           AAD Wednesday
4              AAE Friday

 matched <- match(myData[ , 'name2'],  referenceData[ ,'name'])
> matched
[1] NA NA NA NA

myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
                    name2 newCol newCol2
1 AAB Monday and Thursday   <NA>    <NA>
2            AAC Saturday   <NA>    <NA>
3           AAD Wednesday   <NA>    <NA>
4              AAE Friday   <NA>    <NA>

可以将match()与正则表达式结合起来进行部分匹配吗?

EDIT 可重复的例子过于简单了。更具代表性的内容将是:

name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday","AAB Monday and Thursday","AAB Monday and Thursday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData
                    name2
1 AAB Monday and Thursday
2            AAC Saturday
3           AAD Wednesday
4              AAE Friday
5 AAB Monday and Thursday
6 AAB Monday and Thursday

1 个答案:

答案 0 :(得分:1)

你可以像这样使用sapply和grep:

sapply(referenceData[, 'name'], grep, myData[, 'name2'])

请注意,我颠倒了参数的顺序。 &#34; AAB&#34;作为正则表达式匹配&#34; AAB星期一和星期四&#34;,但不反之亦然

编辑:鉴于您的编辑,如果您知道您始终只匹配前三个字符,您可以尝试这种简单的方法(不需要部分匹配):

first3 <- substr(myData[ , 'name2'],  1, 3)
match(first3,  referenceData[ ,'name'])