将嵌入了其他文本的长状态名称转换为双字母状态缩写

时间:2014-08-30 12:45:58

标签: regex r grep

我的目标是识别以具有其他文本的字符向量写出的美国州,并将州转换为缩写形式。例如,“北卡罗莱纳州”到“NC”。如果向量仅具有长形状态名称,则很简单。但是,我的向量在随机位置有其他文本,如示例“states”。

states <- c("Plano New Jersey", "NC", "xyz", "Alabama 02138", "Texas", "Town Iowa 99999")

从另一篇文章中我发现了这个:

state.abb[match(states, state.name)]

但它仅转换独立的Texas

> state.abb[match(states, state.name)]
[1] NA   NA   NA   NA   "TX"

而不是新泽西州,阿拉巴马州和爱荷华州的字符串。

Fast grep with a vectored pattern or match, to return list of all matches我试过了:

sapply(states, grep(pattern = state.name, x = states, value = TRUE))

Error in get(as.character(FUN), mode = "function", envir = envir) : 
  object 'Alabama 02138' of mode 'function' was not found
In addition: Warning message:
In grep(pattern = state.name, x = states, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used

这也不起作用:

sapply(states, function(x) state.abb[grep(state.name, states)])

这个问题没有帮助: regular expression to convert state names to abbreviations

如何将嵌入的长名称转换为州名缩写?

编辑:我想返回向量,唯一的变化是状态的长名称已被缩写,例如“Plano New Jersey”变为“Plano NJ”。

感谢您纠正和/或教育我。

5 个答案:

答案 0 :(得分:3)

尝试:

indx <- paste0(".*(", paste(state.name, collapse="|"), ").*")
v1 <- gsub(indx, "\\1", states)
ifelse( v1 %in% state.abb, v1, state.abb[match(v1, state.name)])
#[1] "NJ" "NC" NA   "AL" "TX" "IA"

如果您只想用缩写而不是其他文本替换州,您也可以这样做:

indx1 <- paste(state.name, collapse="|")   
indx2 <- state.abb[match(v1, state.name)]

mapply(gsub, indx1, indx2, states, USE.NAMES=F)
#[1] "Plano NJ"      "NC"            "xyz"           "AL 02138"     
#[5] "TX"            "Town IA 99999"

答案 1 :(得分:3)

这是另一种方法:

library(qdap)
mgsub(state.name, state.abb, states)

## [1] "Plano NJ"      "NC"            "xyz"           "AL 02138"      
## "TX"            "Town IA 99999"

如果您不确定州是否会被大写,您可能想要使用:

mgsub(state.name, state.abb, states, ignore.case=TRUE, fixed=FALSE)

答案 2 :(得分:1)

从问题中不清楚预期的结果是什么,但在这里我们假设你想保留输入中的文本只是用缩写替换fuil州名。

创建一个列表st,其名称是完整的州名,其值是缩写。然后使用paste(..., collapse = "|")创建一个匹配任何状态的正则表达式,并使用gsubfn包中的gsubfn来执行替换。

library(gsubfn)
st <- as.list(setNames(state.abb, state.name))
gsubfn(paste(state.name, collapse = "|"), st, states)

,并提供:

[1] "Plano NJ"      "NC"            "xyz"           "AL 02138"     
[5] "TX"            "Town IA 99999"

答案 3 :(得分:1)

如果您不想使用其他套餐,可以使用mapply功能为所有gsubstate.name对应用state.abb,例如:

mapply(gsub,state.name,state.abb,"ALABAMA 123",ignore.case=TRUE,USE.NAMES=FALSE)

这样的结果是一个可以包含替换的列表,例如:

 [1] "AL 123"      "ALABAMA 123" "ALABAMA 123" "ALABAMA 123" "ALABAMA 123" 
 [6] ...

通过从此列表中获取最短的文本,您可以获得所需的结果。因此,我们sort the list based on the length of the text并采取第一个元素。

完整的代码:

replaceState <- function(x) {  
     v = mapply(gsub,state.name,state.abb,x,ignore.case=TRUE, USE.NAMES=FALSE)
     v[order(nchar(v))][1] 
}

sapply(states, replaceState, USE.NAMES=FALSE)

不幸的是,这种方法只替换单个州的名称(最长)。要替换我们需要迭代的多个不同状态,例如:

replaceState <- function(x) {  
     v = mapply(gsub,state.name,state.abb,x,ignore.case=TRUE, USE.NAMES=FALSE)
     v[order(nchar(v))][1] 
}

replaceStates <- function(x) {
     newX = replaceState(x)

     # if they are different a state has been replaced, 
     # we try again to replace all states.
     if(newX != x){ 
          replaceStates(newX)
     } else {
          newX
     }
}

# Note the 'replaceStates'
sapply(states, replaceStates, USE.NAMES=FALSE)

答案 4 :(得分:0)

尝试:

for(r in 1:nrow(states.list)) {
    states = gsub(states.list[r,1], states.list[r,2], states)
}

states
[1] "Plano NJ"      "NC"            "xyz"           "AL 02138"      "TX"            "Town IA 99999"

数据:

states <- c("Plano New Jersey", "NC", "xyz", "Alabama 02138", "Texas", "Town Iowa 99999")

states.list = structure(list(state.name = structure(c(4L, 1L, 5L, 2L, 3L), .Label = c("Alabama", 
"Iowa", "Minnesota", "New Jersey", "Texas"), class = "factor"), 
    state.abb = structure(c(4L, 1L, 5L, 2L, 3L), .Label = c("AL", 
    "IA", "MN", "NJ", "TX"), class = "factor")), .Names = c("state.name", 
"state.abb"), class = "data.frame", row.names = c(NA, -5L))

states.list
  state.name state.abb
1 New Jersey        NJ
2    Alabama        AL
3      Texas        TX
4       Iowa        IA
5  Minnesota        MN