Question

我正在寻找可以执行以下操作的功能（让我们称之为scramblematch）。

query='one five six'
target1='one six two six three four five '
target2=' two five six'

scramblematch(query, target1)返回TRUE和

scramblematch(query, targ2)返回FALSE

stringdist包可能是我需要的，但我不知道如何使用它。

UPDATE1

我正在寻找的功能的用例：我有一个数据集，其中包含多年来逐渐输入的数据。数据集的一个文本字段（textfield）的值未标准化，因此人们输入的内容不同。现在我想通过使用textfield的标准值集来清理这些数据。所有用不同措辞描述相同内容的值都要用标准化值代替。例如（我正在做这件事）：

在我标准化的价值选择中（让我们称之为lookupfactors），我有lookupfactors=c('liver disease', 'and more')。在textfield我有以下行：

liver cancer disease
some other thing
male, liver fibrosis disease
yet another thing
failure of liver, disease

我希望在最终结果中，第1,3和5行（因为他们在内容中有＆＃39;肝脏和＃39;疾病＆＃39;）被{{替换1}}。在这里，我假设输入数据的人不知道确切的术语，但他们知道要放置它的关键字。因此，liver disease值中的单词是lookupfactors中

Answer 1

实施它的一个选项是使用%in%和strsplit：

scramblematch <- function(query, target, sep = " ") {
  all(unlist(strsplit(query, sep)) %in% unlist(strsplit(target, sep)))
}
scramblematch(query, target1)
#[1] TRUE
scramblematch(query, target2)
#[1] FALSE

使用stringi的矢量化方法可能是

library(stringi)
scramblematch <- function(query, target, sep = " ") {
  q <- stri_split_fixed(query, sep)[[1L]]
  sapply(stri_split_fixed(target, sep), function(x) {
    all(q %in% x)
  })
}

scramblematch(query, c(target1, target2))
#[1]  TRUE FALSE

Answer 2

您可以尝试（fixed=TRUE改进来自@ David的评论）：

scramblematch<-function(query,target) {
   Reduce("&",lapply(strsplit(query," ")[[1]],grepl,target,fixed=TRUE))
}

一些基准：

query='one five six'
target1='one six two six three four five '
target2=' two five six'
target<-rep(c(target1,target2),10000)
system.time(scramblematch(query,target))   
# user  system elapsed 
#0.008   0.000   0.008
scramblematchDD <- function(query, target, sep = " ") {
  all(unlist(strsplit(query, sep)) %in% unlist(strsplit(target, sep)))
}
system.time(vapply(target,scramblematchDD,query=query,TRUE))   
# user  system elapsed 
#0.657   0.000   0.658

需要@docendodiscimus解决方案中的vapply，因为它没有矢量化。

R：告诉另一个字符串中是否存在子字符串

2 个答案: