我的目标是识别以具有其他文本的字符向量写出的美国州,并将州转换为缩写形式。例如,“北卡罗莱纳州”到“NC”。如果向量仅具有长形状态名称,则很简单。但是,我的向量在随机位置有其他文本,如示例“states”。
states <- c("Plano New Jersey", "NC", "xyz", "Alabama 02138", "Texas", "Town Iowa 99999")
从另一篇文章中我发现了这个:
state.abb[match(states, state.name)]
但它仅转换独立的Texas
> state.abb[match(states, state.name)]
[1] NA NA NA NA "TX"
而不是新泽西州,阿拉巴马州和爱荷华州的字符串。
从Fast grep with a vectored pattern or match, to return list of all matches我试过了:
sapply(states, grep(pattern = state.name, x = states, value = TRUE))
但
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'Alabama 02138' of mode 'function' was not found
In addition: Warning message:
In grep(pattern = state.name, x = states, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
这也不起作用:
sapply(states, function(x) state.abb[grep(state.name, states)])
这个问题没有帮助: regular expression to convert state names to abbreviations
如何将嵌入的长名称转换为州名缩写?
编辑:我想返回向量,唯一的变化是状态的长名称已被缩写,例如“Plano New Jersey”变为“Plano NJ”。感谢您纠正和/或教育我。
答案 0 :(得分:3)
尝试:
indx <- paste0(".*(", paste(state.name, collapse="|"), ").*")
v1 <- gsub(indx, "\\1", states)
ifelse( v1 %in% state.abb, v1, state.abb[match(v1, state.name)])
#[1] "NJ" "NC" NA "AL" "TX" "IA"
如果您只想用缩写而不是其他文本替换州,您也可以这样做:
indx1 <- paste(state.name, collapse="|")
indx2 <- state.abb[match(v1, state.name)]
mapply(gsub, indx1, indx2, states, USE.NAMES=F)
#[1] "Plano NJ" "NC" "xyz" "AL 02138"
#[5] "TX" "Town IA 99999"
答案 1 :(得分:3)
这是另一种方法:
library(qdap)
mgsub(state.name, state.abb, states)
## [1] "Plano NJ" "NC" "xyz" "AL 02138"
## "TX" "Town IA 99999"
如果您不确定州是否会被大写,您可能想要使用:
mgsub(state.name, state.abb, states, ignore.case=TRUE, fixed=FALSE)
答案 2 :(得分:1)
从问题中不清楚预期的结果是什么,但在这里我们假设你想保留输入中的文本只是用缩写替换fuil州名。
创建一个列表st
,其名称是完整的州名,其值是缩写。然后使用paste(..., collapse = "|")
创建一个匹配任何状态的正则表达式,并使用gsubfn包中的gsubfn
来执行替换。
library(gsubfn)
st <- as.list(setNames(state.abb, state.name))
gsubfn(paste(state.name, collapse = "|"), st, states)
,并提供:
[1] "Plano NJ" "NC" "xyz" "AL 02138"
[5] "TX" "Town IA 99999"
答案 3 :(得分:1)
如果您不想使用其他套餐,可以使用mapply功能为所有gsub
和state.name
对应用state.abb
,例如:
mapply(gsub,state.name,state.abb,"ALABAMA 123",ignore.case=TRUE,USE.NAMES=FALSE)
这样的结果是一个可以包含替换的列表,例如:
[1] "AL 123" "ALABAMA 123" "ALABAMA 123" "ALABAMA 123" "ALABAMA 123"
[6] ...
通过从此列表中获取最短的文本,您可以获得所需的结果。因此,我们sort the list based on the length of the text并采取第一个元素。
完整的代码:
replaceState <- function(x) {
v = mapply(gsub,state.name,state.abb,x,ignore.case=TRUE, USE.NAMES=FALSE)
v[order(nchar(v))][1]
}
sapply(states, replaceState, USE.NAMES=FALSE)
不幸的是,这种方法只替换单个州的名称(最长)。要替换我们需要迭代的多个不同状态,例如:
replaceState <- function(x) {
v = mapply(gsub,state.name,state.abb,x,ignore.case=TRUE, USE.NAMES=FALSE)
v[order(nchar(v))][1]
}
replaceStates <- function(x) {
newX = replaceState(x)
# if they are different a state has been replaced,
# we try again to replace all states.
if(newX != x){
replaceStates(newX)
} else {
newX
}
}
# Note the 'replaceStates'
sapply(states, replaceStates, USE.NAMES=FALSE)
答案 4 :(得分:0)
尝试:
for(r in 1:nrow(states.list)) {
states = gsub(states.list[r,1], states.list[r,2], states)
}
states
[1] "Plano NJ" "NC" "xyz" "AL 02138" "TX" "Town IA 99999"
数据:
states <- c("Plano New Jersey", "NC", "xyz", "Alabama 02138", "Texas", "Town Iowa 99999")
states.list = structure(list(state.name = structure(c(4L, 1L, 5L, 2L, 3L), .Label = c("Alabama",
"Iowa", "Minnesota", "New Jersey", "Texas"), class = "factor"),
state.abb = structure(c(4L, 1L, 5L, 2L, 3L), .Label = c("AL",
"IA", "MN", "NJ", "TX"), class = "factor")), .Names = c("state.name",
"state.abb"), class = "data.frame", row.names = c(NA, -5L))
states.list
state.name state.abb
1 New Jersey NJ
2 Alabama AL
3 Texas TX
4 Iowa IA
5 Minnesota MN