R中的正则表达式:如何从字符向量中的字母数字元素中提取某些字母?

时间:2014-11-19 21:41:02

标签: regex r

比如说,我有以下字母数字元素的字符向量,在元素中的某处包含状态缩写:

strings <- c("0001AZ226", "0001CA243", "0NA01CT134", "0001CT1NA", "0001ID112", "NAVA230")

如何提取字母,不包括NA?即,

somefunction(strings)
[1] "AZ"  "CA"  "CT"  "CT"  "ID"  "VA"

我之前使用正则表达式删除了每个元素的所有非整数,但从不删除所有数字,只删除字母N和A.

这是我尝试过的,但它没有奏效:

 sub(paste(LETTERS[c(2:13,15:26)], collapse = "|"), "", strings, fixed = TRUE)

4 个答案:

答案 0 :(得分:2)

一个简单的解决方案:

gsub("\\d+|NA", "", strings)
# [1] "AZ" "CA" "CT" "CT" "ID" "VA"

答案 1 :(得分:1)

可以使用looarounds完成。

 # (?i)(?:(?!na|(?<=n)(?=a))[a-z])+

 (?i)           # Case insensitive modifier (or use as regex flag)
 (?:            # Cluster group
      (?!            # Negative assertion
           na             # Not NA ahead
        |  (?<= n )       # Not N behind,
           (?= a )        # and A ahead (at this location) 
      )              # End Negative assertion
      [a-z]          # Safe, grab this single character
 )+             # End Cluster group, do 1 to many times

仅匹配这些"AZ" "CA" "CT" "CT" "ID" "VA"

答案 2 :(得分:1)

默认情况下,state数据集可用。看看:

 ?state

sts <- paste(state.abb,collapse="|")

sub(paste0( "(.+)(", sts, ")(.+)"), "\\2", strings)
[1] "AZ" "CA" "CT" "CT" "ID" "VA"

有人尝试编辑此内容并拨打dput(states.abb),然后将其粘贴到新作业中。鉴于state始终可用,这完全没有必要,因此我的拒绝。我能看到的唯一值可能是建议人们实际查看帮助页面并说明state.abb的样子:

?state
dput(state.abb)
#c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", 
... snipped the rest.

答案 3 :(得分:1)

如果状态出现后只跟三个字符。

strings.stripped <- gsub("([A-Z]{2}).{3}$", "\\1", strings)