Question

我想在一组字符串中搜索特定模式。

给出这两个字符串向量：

actions <- c("taking","using")

nouns <- c("medication","prescription")

我想找到动作 + 名词的任意组合，按此特定顺序，而不是名词+动作。例如，使用以下文本我想检测组合：

使用药物
服用药物
使用prescritpion

使用以下文字：

phrases <- c("he was using medication",
              "medication using it",
              "finding medication",
              "taking the left",
              "using prescription medication",
              "taking medication drug")

我尝试使用grep("\\b(taking|using+medication|prescriptio)\\b",phrases,value = FALSE)，但这显然是错误的。

Answer 1

您可以使用actions和nouns值构建替换组，并将它们放入更大的正则表达式中：

actions <- c("taking","using")
nouns <- c("medication","prescription")
phrases <- c("he was using medication","medication using it","finding medication","taking the left","using prescription medication","taking medication drug")
grep(paste0("(",paste(actions, collapse="|"), ")\\s+(", paste(nouns,collapse="|"),")"), phrases, value=FALSE)
## => [1] 1 5 6
## and a visual check
grep(paste0("(",paste(actions, collapse="|"), ")\\s+(", paste(nouns,collapse="|"),")"), phrases, value=TRUE)
## => [1] "he was using medication" "using prescription medication" "taking medication drug"

请参阅online R demo

生成的正则表达式看起来像

(taking|using)\s+(medication|prescription)

请参阅regex demo。

<强>详情：

(taking|using) - 与taking或（|）using
\s+ - 一个或多个空格
(medication|prescription) - 与medication或prescription匹配的候补组。

请注意，(...) 捕获组可能会替换为(?:...) 非捕获组，以避免将子匹配保留在内存中。< / p>

R - 用grep

1 个答案: