Question

比如说，我有以下字母数字元素的字符向量，在元素中的某处包含状态缩写：

strings <- c("0001AZ226", "0001CA243", "0NA01CT134", "0001CT1NA", "0001ID112", "NAVA230")

如何提取字母，不包括NA？即，

somefunction(strings)
[1] "AZ"  "CA"  "CT"  "CT"  "ID"  "VA"

我之前使用正则表达式删除了每个元素的所有非整数，但从不删除所有数字，只删除字母N和A.

这是我尝试过的，但它没有奏效：

 sub(paste(LETTERS[c(2:13,15:26)], collapse = "|"), "", strings, fixed = TRUE)

Answer 1

一个简单的解决方案：

gsub("\\d+|NA", "", strings)
# [1] "AZ" "CA" "CT" "CT" "ID" "VA"

Answer 2

可以使用looarounds完成。

 # (?i)(?:(?!na|(?<=n)(?=a))[a-z])+

 (?i)           # Case insensitive modifier (or use as regex flag)
 (?:            # Cluster group
      (?!            # Negative assertion
           na             # Not NA ahead
        |  (?<= n )       # Not N behind,
           (?= a )        # and A ahead (at this location) 
      )              # End Negative assertion
      [a-z]          # Safe, grab this single character
 )+             # End Cluster group, do 1 to many times

仅匹配这些"AZ" "CA" "CT" "CT" "ID" "VA"

Answer 3

默认情况下，state数据集可用。看看：

 ?state

sts <- paste(state.abb,collapse="|")

sub(paste0( "(.+)(", sts, ")(.+)"), "\\2", strings)
[1] "AZ" "CA" "CT" "CT" "ID" "VA"

有人尝试编辑此内容并拨打dput(states.abb)，然后将其粘贴到新作业中。鉴于state始终可用，这完全没有必要，因此我的拒绝。我能看到的唯一值可能是建议人们实际查看帮助页面并说明state.abb的样子：

?state
dput(state.abb)
#c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", 
... snipped the rest.

Answer 4

如果状态出现后只跟三个字符。

strings.stripped <- gsub("([A-Z]{2}).{3}$", "\\1", strings)

R中的正则表达式：如何从字符向量中的字母数字元素中提取某些字母？

4 个答案: