R剪切字符串功能

时间:2015-10-22 13:06:44

标签: r performance

我想知道R中是否有任何类似于cut函数的函数但是对字符串有效?

我正在研究的代码是将有关美国各州的数据分配到一个名为Region的分类变量:东北,中西部,南部,西部有4个地区。存储每个数据的数据帧将状态名称存储在名为" state"的变量中。以及它的缩写形式:"纽约"为纽约或" MS"例如,密西西比州。区域变量需要添加到数据框中,我目前正在做如下(这是作业,所以我想表明我已经有了一个解决方案,我只是在寻找一个可能更好的解决方案):

create.region <- function(state) {
northeast <- c("CT", "ME", "MA", "NH", "RI", "VT", "NJ", "NY", "PA")
midwest <- c("IL", "IN", "MI", "OH", "WI", "IA", "KS", "MN", "MO", "NE", "ND", "SD")
south <- c("DE", "DC", "FL", "GA", "MD", "NC", "SC", "VA", "WV", "AL", "KY", "TN", "MS", "AR", "LA", "OK", "TX")
west <- c("AZ", "CO", "ID", "MT", "NV", "NM", "UT", "WY", "AK", "CA", "HI", "OR", "WA")
region <- ifelse(state %in% northeast, "Northeast", 
          ifelse(state %in% midwest, "Midwest",
          ifelse(state %in% south, "South",
          ifelse(state %in% west, "West", NA ))))
return(region)
}
birth_data <- within(birth_data, region <- create.region(state))

我还不太了解R,我担心代码的效率。我在过去发现切割函数是以这种方式对数字数据进行分类的更简洁有效的方法,但它显然不适用于字符向量。有没有类似cut的函数允许字符赋值规则,而不仅仅是数字规则?

3 个答案:

答案 0 :(得分:2)

最简单的方法是通过矢量映射名称。

首先,我们准备地图:

all_states = c('northeast', 'midwest', 'south', 'west')

states_for_region = function (region) {
    states = get(region)
    setNames(rep(region, length(states)), states)
}

states_map = unlist(lapply(all_states, states_for_region))

我们也可以为每个区域手动构建states_map,然后连接结果。但上述内容不那么重复。

然后,我们进行实际的映射,现在只需要一行。

region = states_map[state]

为了提高效率,最好准备函数外的 。否则,只要您调用该函数,它就会重新生成。

答案 1 :(得分:2)

开箱即用的R包含变量state.abbstate.region。前者是所有状态缩写的字符向量,后者是包含相应区域的相同长度的4级因子;因此,要获得MS的区域,请说:

state.region[state.abb == "MS"]
## [1] South
## Levels: Northeast South North Central West

如果你想要一个不同的分类,那么很容易定义你自己的state.region替代品,然后使用上面的代码。

另外,请注意state.name也存在,其长度与上述两个变量相同,并给出完整的州名。

答案 2 :(得分:0)

您还可以使用levels<-函数以及映射列表。

以下是一个例子:

## Create your mapping....
## Overkill in this example as @Grothendieck has pointed out,
##   but still applicable in a general scenario

myLevs <- list(
  Northeast = c("CT", "ME", "MA", "NH", "RI", "VT", "NJ", "NY", "PA"), 
  Midwest = c("IL", "IN", "MI", "OH", "WI", "IA", "KS", "MN", "MO", "NE", "ND", "SD"), 
  South = c("DE", "DC", "FL", "GA", "MD", "NC", "SC", "VA", "WV", "AL", "KY", "TN", "MS", "AR", "LA", "OK", "TX"), 
  West = c("AZ", "CO", "ID", "MT", "NV", "NM", "UT", "WY", "AK", "CA", "HI", "OR", "WA"))

现在,创建一个样本向量:

set.seed(1)
x <- sample(state.abb, 10)

factor向量,并更改其levels。这可以分两步完成(y <- factor(x); levels(y) <- myLevs)或一步完成,看起来很神秘:

y <- `levels<-`(factor(x), myLevs)

这是输出:

x
#  [1] "IN" "ME" "NV" "TX" "GA" "SD" "TN" "NH" "NE" "AZ"
y
#  [1] Midwest   Northeast West      South     South     Midwest   South    
#  [8] Northeast Midwest   West     
# Levels: Northeast Midwest South West