Question

我有大约7个国家/地区的名称，这些名称存储在以下位置：

Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')

现在，如果给定的句子有这些词，我必须找出使用r。有时，一个国家的名字隐藏在句子中的连续字母中。例如：

你们都必须付出代价，否则你们将遇到麻烦。

如果这句话被传递，它应该返回“korea”

我试过了：

grep('You|all|must|pay|it|back|or|each|of|you|will|be|in|trouble',Random, value = TRUE,ignore.case=TRUE,
 fixed = FALSE)

它应该返回韩国

但它不起作用。也许我不应该使用Partial Matching，但我对它没有太多了解。

感谢任何帮助。

Answer 1

您可以使用方便的stringr库。首先，删除我们想要匹配的句子中的所有标点符号和空格。

> library(stringr)
> txt <- "You all must pay it back, or each of you will be in trouble."
> g <- gsub("[^a-z]", "", tolower(txt))
# [1] "Youallmustpayitbackoreachofyouwillbeintrouble"

然后我们可以使用str_detect来查找匹配项。

> Random[str_detect(g, Random)]
# [1] "korea"

基本上你只是在一个句子中寻找一个子串，所以首先折叠句子似乎是一个很好的方法。或者，您可以使用str_locate和str_sub来查找相关的子字符串。

> no <- na.omit(str_locate(g, Random))
> str_sub(g, no[,1], no[,2])
# [1] "korea"

编辑我想出了一个

> Random[Vectorize(grepl)(Random, g)]
# [1] "korea"

Answer 2

仅使用基本功能：

Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
Random2=paste(Random,collapse="|")     #creating pattern for match

text="bac**k, or ea**ch of you will be in trouble."
text2=gsub("[[:punct:][:space:]]","",text,perl=T)  #removing punctuations and space characters

regmatches(text2,gregexpr(Random2,text2))
[[1]]
[1] "korea"

Answer 3

您可以使用stringi这些操作更快

library(stringi)
Random[stri_detect_regex(gsub("[^A-Za-z]", "", txt), Random)]
#[1] "korea"

#data
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')    
txt <- "You all must pay it back, or each of you will be in trouble."

Answer 4

尝试：

Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')    
txt <- "You all must pay it back, or each of you will be in trouble."

tt  <- gsub("[[:punct:]]|\\s+", "", txt)

unlist(sapply(Random, function(r) grep(r, tt)))
korea 
    1

使用R在句子中连续匹配字符串

4 个答案: