我是R编程新手,想尝试提取包含1个以上大写字母的字母和单词。
下面是字符串的示例和我想要的输出。
x <- c("123AB123 Electrical CDe FG123-4 ...",
"12/1/17 ABCD How are you today A123B",
"20.9.12 Eat / Drink XY1234 for PQRS1",
"Going home H123a1 ab-cd1",
"Change channel for al1234 to al5678")
#Desired Output
#[1] "123AB123 CDe FG123-4" "ABCD A123B" "XY1234 PQRS"
#[2] "H123a1 ab-cd1" "al1234 al5678"
到目前为止,我在Stack Overflow上遇到过两个独立的解决方案:
how-to-count-capslock-in-string-using-r
library(stringr)
str_count(x, "\\b[A-Z]{2,}\\b")
他的代码提供了一个字符串大于1的大写次数,但除了提取字母数字之外,我想提取这些字。
如果我的问题或研究不够全面,请原谅我。当我可以访问包含R和数据集的工作站时,我将发布我的研究解决方案,用于在12小时内提取包含数字的所有单词。
答案 0 :(得分:2)
这有效:
library(stringr)
# split words from strings into one-word-per element vector
y <- unlist(str_split(x, ' '))
# find strings with at least 2 uppercase
uppers <- str_count(y, '[A-Z]')>1
# find strings with at least 1 letter
alphas <- str_detect(y, '[:alpha:]')
# find strings with at least 1 number
nums <- str_detect(y, '[:digit:]')
# subset vector to those that have 2 uppercase OR a letter AND a number
y[uppers | (alphas & nums)]
[1] "123AB123" "CDe" "FG123-4" "ABCD" "A123B" "XY1234"
[7] "PQRS1" "H123a1" "ab-cd1" "al1234" "al5678"
答案 1 :(得分:1)
单个正则表达式解决方案也可以使用:
> res <- str_extract_all(x, "(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)")
> unlist(res)
[1] "123AB123" "CDe" "FG123-4" "ABCD" "A123B" "XY1234"
[7] "PQRS1" "H123a1" "ab-cd1" "al1234" "al5678"
这也适用于使用PCRE正则表达式引擎的基础R中的regmatches
:
> res2 <- regmatches(x, gregexpr("(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)", x, perl=TRUE))
> unlist(res2)
[1] "123AB123" "CDe" "FG123-4" "ABCD" "A123B" "XY1234"
[7] "PQRS1" "H123a1" "ab-cd1" "al1234" "al5678"
为什么会这样?
(?<!\\S)
- 在空格或字符串开头后找到一个位置(?:
- 开始定义了两种替代模式的非捕获组:
(?=\\S*\\p{L})(?=\\S*\\d)\\S+
(?=\\S*\\p{L})
- 确保在0 +非空格字符后面有一个字母(为了获得更好的效果,请将\\S*
替换为[^\\s\\p{L}]*
)(?=\\S*\\d)
- 确保在0 +非空白字符后面有一个数字(为了获得更好的效果,请将\\S*
替换为[^\\s\\d]*
)\\S+
- 匹配1个或多个非空白字符|
- 或(?:\\S*\\p{Lu}){2}\\S*
:
(?:\\S*\\p{Lu}){2}
- 出现2次0 +非空白字符(\\S*
,以获得更好的效果,替换为[^\\s\\p{Lu}]*
)后跟1个大写字母(\\p{Lu}
)< / LI>
\\S*
- 0+非空白字符)
- 非捕获组的结束。要加入与每个字符向量相关的匹配项,您可以使用
unlist(lapply(res, function(c) paste(unlist(c), collapse=" ")))
输出:
[1] "123AB123 CDe FG123-4" "ABCD A123B" "XY1234 PQRS1"
[4] "H123a1 ab-cd1" "al1234 al5678"