使用函数匹配()我想在不同数据帧的两个字符向量之间执行部分字符串匹配。 匹配值的位置必须保留,因为它稍后用于引用相邻列,我发现函数match()最适合它。
我可以进行精确的字符串匹配:
## exact string matching
name <- c("AAB", "AAC", "AAD","AAE")
meaning1 <- c('circular','parallel','perpendicular','none')
meaning2 <- c('surface','longitudinal','transverse','not detected')
meaning3 <- c('category 1','category 1','category 1','category 2')
referenceData <- data.frame(name, meaning1, meaning2, meaning3, stringsAsFactors = FALSE)
name2 <- c("AAB", "AAC", "AAD","AAE")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> referenceData
name meaning1 meaning2 meaning3
1 AAB circular surface category 1
2 AAC parallel longitudinal category 1
3 AAD perpendicular transverse category 1
4 AAE none not detected category 2
> myData
name2
1 AAB
2 AAC
3 AAD
4 AAE
matched <- match(myData[ , 'name2'], referenceData[ ,'name'])
> matched
[1] 1 2 3 4
myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
name2 newCol newCol2
1 AAB circular surface
2 AAC parallel longitudinal
3 AAD perpendicular transverse
4 AAE none not detected
然而,真实数据的复杂程度很小,只能部分匹配,所以我的上述方法不起作用:
name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData
name2
1 AAB Monday and Thursday
2 AAC Saturday
3 AAD Wednesday
4 AAE Friday
matched <- match(myData[ , 'name2'], referenceData[ ,'name'])
> matched
[1] NA NA NA NA
myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
name2 newCol newCol2
1 AAB Monday and Thursday <NA> <NA>
2 AAC Saturday <NA> <NA>
3 AAD Wednesday <NA> <NA>
4 AAE Friday <NA> <NA>
可以将match()与正则表达式结合起来进行部分匹配吗?
EDIT 可重复的例子过于简单了。更具代表性的内容将是:
name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday","AAB Monday and Thursday","AAB Monday and Thursday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData
name2
1 AAB Monday and Thursday
2 AAC Saturday
3 AAD Wednesday
4 AAE Friday
5 AAB Monday and Thursday
6 AAB Monday and Thursday
答案 0 :(得分:1)
你可以像这样使用sapply和grep:
sapply(referenceData[, 'name'], grep, myData[, 'name2'])
请注意,我颠倒了参数的顺序。 &#34; AAB&#34;作为正则表达式匹配&#34; AAB星期一和星期四&#34;,但不反之亦然
编辑:鉴于您的编辑,如果您知道您始终只匹配前三个字符,您可以尝试这种简单的方法(不需要部分匹配):
first3 <- substr(myData[ , 'name2'], 1, 3)
match(first3, referenceData[ ,'name'])