在R中找到允许最多n个单词分隔的单词匹配

时间:2018-03-05 16:52:55

标签: r regex

在word2之前寻找匹配word1,允许word1和word2之间最多5个字的分隔。例如,如果word1是apple而word2是芒果,那么pattern应该匹配&apple; apple就像芒果一样的水果'但不匹配'芒果是一种类似苹果的水果。 (word1之前的word2)或'苹果和橘子是水果,如芒果' (超过5个字)。 python中的示例正则表达式是{{1}}。什么是类似的模式和函数来识别R?

中的这种模式

2 个答案:

答案 0 :(得分:1)

#DATA
word1 = "apple"
word2 = "mango"
p1 = "apple is a fruit like mango"
p2 = "apple and orange are fruits, like mango"
p3 = "mango is a fruit like apple"

#FUNCTION
foo = function(word1, word2, string){
    ind2 = unlist(gregexpr(word2, string))[1]
    ind1 = unlist(gregexpr(word1, string))[1] 
    nwords = length(unlist(gregexpr(" ", substr(string, ind1, ind2))))
    if(ind2 > ind1 & nwords <= 5){
        substr(string, ind1, ind2 + nchar(word2))
    }else{
        NA
    }
}

#USAGE
foo(word1, word2, p1)
#[1] "apple is a fruit like mango"

foo(word1, word2, p2)
#[1] NA

foo(word1, word2, p3)
#[1] NA

答案 1 :(得分:1)

这个有效。将第一个单词计为apple,这个正则表达式搜索下一个4并匹配,如果它在定义的单词限制中找到芒果。

library(stringr)
> stri <- c('apple is a fruit like mango','apple and orange are fruits, like mango','apple is not a fruit like orange or mango')
> stri_extract_all(str = stri, regex = 'apple(\\s\\w+){1,4}?.mango')

[[1]]
[1] "apple is a fruit like mango"

[[2]]
[1] NA

[[3]]
[1] NA