比如说,我有以下字母数字元素的字符向量,在元素中的某处包含状态缩写:
strings <- c("0001AZ226", "0001CA243", "0NA01CT134", "0001CT1NA", "0001ID112", "NAVA230")
如何提取字母,不包括NA?即,
somefunction(strings)
[1] "AZ" "CA" "CT" "CT" "ID" "VA"
我之前使用正则表达式删除了每个元素的所有非整数,但从不删除所有数字,只删除字母N和A.
这是我尝试过的,但它没有奏效:
sub(paste(LETTERS[c(2:13,15:26)], collapse = "|"), "", strings, fixed = TRUE)
答案 0 :(得分:2)
一个简单的解决方案:
gsub("\\d+|NA", "", strings)
# [1] "AZ" "CA" "CT" "CT" "ID" "VA"
答案 1 :(得分:1)
可以使用looarounds完成。
# (?i)(?:(?!na|(?<=n)(?=a))[a-z])+
(?i) # Case insensitive modifier (or use as regex flag)
(?: # Cluster group
(?! # Negative assertion
na # Not NA ahead
| (?<= n ) # Not N behind,
(?= a ) # and A ahead (at this location)
) # End Negative assertion
[a-z] # Safe, grab this single character
)+ # End Cluster group, do 1 to many times
仅匹配这些"AZ" "CA" "CT" "CT" "ID" "VA"
答案 2 :(得分:1)
默认情况下,state
数据集可用。看看:
?state
sts <- paste(state.abb,collapse="|")
sub(paste0( "(.+)(", sts, ")(.+)"), "\\2", strings)
[1] "AZ" "CA" "CT" "CT" "ID" "VA"
有人尝试编辑此内容并拨打dput(states.abb)
,然后将其粘贴到新作业中。鉴于state
始终可用,这完全没有必要,因此我的拒绝。我能看到的唯一值可能是建议人们实际查看帮助页面并说明state.abb的样子:
?state
dput(state.abb)
#c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA",
... snipped the rest.
答案 3 :(得分:1)
如果状态出现后只跟三个字符。
strings.stripped <- gsub("([A-Z]{2}).{3}$", "\\1", strings)