Question

我正在尝试编写一个函数，以便我可以从匹配正则表达式的字符串中获取所有子字符串，例如： -

str <- "hello Brother How are you"

我想从str中提取所有子字符串，其中这些子字符串与此正则表达式匹配 - “[A-z] + [A-z] +”

导致 -

"hello Brother"
"Brother How"
"How are"
"are you"

是否有任何库函数可以做到这一点？

Answer 1

你可以使用stringr库str_match_all函数和Tim Pietzcker在他的回答中描述的方法（捕获一个未发现的正向前瞻）：

> library(stringr)
> str <- "hello Brother How are you"
> res <- str_match_all(str, "(?=\\b([[:alpha:]]+ [[:alpha:]]+))")
> l <- unlist(res)
> l[l != ""]
## [1] "hello Brother" "Brother How"   "How are"       "are you"

或者只获取unqiue值：

> unique(l[l != ""])
##[1] "hello Brother" "Brother How"   "How are"       "are you"

我建议您使用[[:alpha:]]代替[A-z] since this pattern matches more than just letters。

Answer 2

正则表达式匹配＆＃34;消费＆＃34;因此，它们匹配的文本（通常）相同的文本位不能匹配两次。但是有些名为lookaround assertions的构造不会使用它们匹配的文本，而且可能包含capturing groups。

这使您的努力成为可能（尽管您不能使用[A-z]，但这并不是您认为的那样）：

(?=\b([A-Za-z]+ [A-Za-z]+))

将按预期匹配;你需要查看匹配结果的第1组，而不是匹配的文本本身（它总是为空）。

\b word boundary anchor是必要的，以确保我们的匹配始终从单词的开头开始（否则您也会得到结果"ello Brother"，{{1} }，"llo Brother"和"lo Brother"）。

测试live on regex101.com。

如何从R

2 个答案: