Question

我是R编程新手，想尝试提取包含1个以上大写字母的字母和单词。

下面是字符串的示例和我想要的输出。

    x <- c("123AB123 Electrical CDe FG123-4 ...", 
           "12/1/17 ABCD How are you today A123B", 
           "20.9.12 Eat / Drink XY1234 for PQRS1",
           "Going home H123a1 ab-cd1",
           "Change channel for al1234 to al5678")

    #Desired Output
    #[1] "123AB123 CDe FG123-4"  "ABCD A123B"  "XY1234 PQRS"  
    #[2] "H123a1 ab-cd1"  "al1234 al5678"

到目前为止，我在Stack Overflow上遇到过两个独立的解决方案：

提取包含数字的所有字词 - ＆gt;对我没有帮助，因为我应用该函数的列包含许多日期字符串; ＆＃34; 12/1/17 ABCD今天你好吗A123B＆＃34;
识别具有多个大写字母/大写字母的字符串 - ＆gt; Pierre Lafortune提供了以下解决方案：

how-to-count-capslock-in-string-using-r

    library(stringr)
    str_count(x, "\\b[A-Z]{2,}\\b")

他的代码提供了一个字符串大于1的大写次数，但除了提取字母数字之外，我想提取这些字。

如果我的问题或研究不够全面，请原谅我。当我可以访问包含R和数据集的工作站时，我将发布我的研究解决方案，用于在12小时内提取包含数字的所有单词。

Answer 1

这有效：

library(stringr)

# split words from strings into one-word-per element vector
y <- unlist(str_split(x, ' '))

# find strings with at least 2 uppercase
uppers <- str_count(y, '[A-Z]')>1

# find strings with at least 1 letter
alphas <- str_detect(y, '[:alpha:]')

# find strings with at least 1 number
nums <- str_detect(y, '[:digit:]')

# subset vector to those that have 2 uppercase OR a letter AND a number
y[uppers | (alphas & nums)]

 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"

Answer 2

单个正则表达式解决方案也可以使用：

> res <- str_extract_all(x, "(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)")
> unlist(res)
 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"

这也适用于使用PCRE正则表达式引擎的基础R中的regmatches：

> res2 <- regmatches(x, gregexpr("(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)", x, perl=TRUE))
> unlist(res2)
 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"

为什么会这样？

(?<!\\S) - 在空格或字符串开头后找到一个位置
(?: - 开始定义了两种替代模式的非捕获组：
- (?=\\S*\\p{L})(?=\\S*\\d)\\S+
  - (?=\\S*\\p{L}) - 确保在0 +非空格字符后面有一个字母（为了获得更好的效果，请将\\S*替换为[^\\s\\p{L}]*）
  - (?=\\S*\\d) - 确保在0 +非空白字符后面有一个数字（为了获得更好的效果，请将\\S*替换为[^\\s\\d]*）
  - \\S+ - 匹配1个或多个非空白字符
- | - 或
- (?:\\S*\\p{Lu}){2}\\S*：
  - (?:\\S*\\p{Lu}){2} - 出现2次0 +非空白字符（\\S*，以获得更好的效果，替换为[^\\s\\p{Lu}]*）后跟1个大写字母（\\p{Lu}）< / LI>
  - \\S* - 0+非空白字符
) - 非捕获组的结束。

要加入与每个字符向量相关的匹配项，您可以使用

unlist(lapply(res, function(c) paste(unlist(c), collapse=" ")))

查看online R demo。

输出：

[1] "123AB123 CDe FG123-4" "ABCD A123B"           "XY1234 PQRS1"        
[4] "H123a1 ab-cd1"        "al1234 al5678"

使用R提取大于1的大写字母数字和单词

2 个答案: