如何在R中将一列分成两列,以便所有大写字母单词都在一列中?

时间:2015-11-11 14:45:39

标签: r

我有一个这样的专栏:

x <- c('WV West Virginia','FL Florida','CA California','SC South Carolina')

# [1] WV West Virginia                  FL Florida                        
# [3] CA California                     SC South Carolina              

如何将缩写与整个州名分开。我想给两个新列两个不同的标题。我想我只能通过将所有大写字母分开来解决这个问题。

6 个答案:

答案 0 :(得分:4)

使用tidyr,我们可以使用separate将列扩展为两个,同时指定新名称。参数extra=merge将输出限制为给定列。分隔符将默认为非alpha-numerics:

library(tidyr)
separate(df, x, c("Abb", "State"), extra="merge")
#  Abb          State
#1  WV  West Virginia
#2  FL        Florida
#3  CA     California
#4  SC South Carolina

数据

x = c('WV West Virginia', 'FL Florida','CA California', 'SC South Carolina')  

答案 1 :(得分:3)

没有外部包的两种方法:

方法1:您可以将substringnchar结合使用。

dat <-data.frame(raw=c("WV West Virginia","FL Florida", "CA California","SC South Carolina"),
                 stringsAsFactors=F)


dat$code <- substr(dat$raw,1,2)
dat$state <- substr(dat$raw, 4, nchar(dat$raw))

> dat
                raw code          state
1  WV West Virginia   WV  West Virginia
2        FL Florida   FL        Florida
3     CA California   CA     California
4 SC South Carolina   SC South Carolina

方法二:您可以使用正则表达式替换部分字符串:

##approach two: regex
dat$code <- sub(" .+","",dat$raw)
dat$state <- sub("[A-Z]{2} ","",dat$raw)

答案 2 :(得分:3)

使用基础数据集包附带的state.*常量

DF = data.frame(raw=c("WV West Virginia","FL Florida","CA California","SC South Carolina"))

DF$state.abbr <- substr(DF$raw, 1, 2)
DF$state.name <- state.name[ match(DF$state.abbr, state.abb) ]

#                 raw state.abbr     state.name
# 1  WV West Virginia         WV  West Virginia
# 2        FL Florida         FL        Florida
# 3     CA California         CA     California
# 4 SC South Carolina         SC South Carolina

这样,您可以在州名中输入错别字或其他奇怪的内容。

答案 3 :(得分:2)

使用reshape2包。

    library(reshape2)
    x <- rbind('WV West Virginia','FL Florida','CA California','SC South Carolina')
    colsplit(x," ",c("Code","State"))

输出:

  Code          State
1   WV  West Virginia
2   FL        Florida
3   CA     California
4   SC South Carolina

答案 4 :(得分:2)

根据@ rawr的评论,我们可以split&#39; x&#39;在前两个字符后面的空格处,即由正则表达式的外观((?<=^.{2}))显示。输出结果为list,我们rbind转换为data.frame,然后使用原始向量&#39; x&#39;转换为cbind

 cbind(x, as.data.frame(do.call(rbind,strsplit(x, '(?<=^.{2})\\s+', perl=TRUE)),
                    stringsAsFactors=FALSE))
 #                x V1             V2
 #1  WV West Virginia WV  West Virginia
 #2        FL Florida FL        Florida
 #3     CA California CA     California
 #4 SC South Carolina SC South Carolina

或者代替正则表达式的外观,我们可以将stri_splitn=2一起使用并在空白处拆分。

 library(stringi)
 cbind(x,as.data.frame(do.call(rbind,stri_split(x, regex='\\s+', n=2))))

答案 5 :(得分:0)

这是 data.table / gsub方法:

x <- c('WV West Virginia','FL Florida','CA California','SC South Carolina')

data.table::data.table(x)[, 
    abb := gsub("(^[A-Z]{2})( .+)", "\\1", x)][, 
    state := gsub("(^[A-Z]{2})( .+)", "\\2", x)][]

##                    x abb           state
## 1:  WV West Virginia  WV   West Virginia
## 2:        FL Florida  FL         Florida
## 3:     CA California  CA      California
## 4: SC South Carolina  SC  South Carolina