Question

我如何编写一个正则表达式来抓取位于任何后续字符之外的大写字母，直到空格？

输入： cake pietypeAPPLE CRUMBLE tart toastTexas price

例如，我想抓住＆＃34; APPLE＆＃34;尽管没有空间。我想＆＃34; CRUMBLE＆＃34;。我也想要＆＃34; Texas＆＃34;即使并非所有组件都是大写的。

我将使用gsub(pattern, replacement = "", x = string)获取以下输出

输出： cake pietype tart toast price

谢谢！

Answer 1

您可以使用regmatches来提取这些子字符串。

> x <- 'cake pietypeAPPLE CRUMBLE tart toastTexas price'
> regmatches(x, gregexpr('[A-Z]\\S+', x))[[1]]
# [1] "APPLE"   "CRUMBLE" "Texas"

或者，如果您想严格匹配字母字符。

> regmatches(x, gregexpr('[A-Z][A-Za-z]+', x))[[1]]

如果你想要替换它们，我会使用以下内容来避免单词之间留下多余的空间。

> gsub('[A-Z][A-Za-z]+( [A-Z][A-Za-z]+)*', '', x)
# [1] "cake pietype tart toast price"

Answer 2

以下是使用qdapRegex包的方法：

x <- 'cake pietypeAPPLE CRUMBLE tart toastTexas price'

library(qdapRegex)
rm_default(x, pattern="[A-Z][A-Za-z]*")

## [1] "cake pietype tart toast price"

如果您想提取这些条款：

rm_default(x, pattern="[A-Z][A-Za-z]*", extract=TRUE)

## [[1]]
## [1] "APPLE"   "CRUMBLE" "Texas"

正常表达条件直到下一个空格

2 个答案: