根据逗号分割数据框列

时间:2019-04-28 02:57:00

标签: r dplyr tidyr

我有一个具有以下结构的数据框,标题为“ final_proj_data”

ID          County              Population     Year  
<dbl>       <chr>               <dbl>          <dbl>    
1003    Baldwin County, Alabama 169162         2006     
1015    Calhoun County, Alabama 112903         2006     
1043    Cullman County, Alabama 80187          2006     
1049    DeKalb County, Alabama  68014          2006 

我试图将“县”列拆分为“县”和“州”两个不同的列,并删除逗号。

我尝试了split()函数的许多排列,但我一直找回此错误:

  

错误:var必须计算为单个数字或列名,而不是       字符向量

我已经尝试过了

  final_proj_data %>% 
separate(final_proj_data$County, c("State", "County"), sep = ",", remove = TRUE)
    final_proj_data %>% 
separate(data = final_proj_data, col = County,
 into = c("State", "County"), sep = ",")

我不确定自己在做什么错,或者不确定“ col =”为何不断抛出此错误。任何帮助将不胜感激!

4 个答案:

答案 0 :(得分:3)

使用dplyr和基数R:

library(dplyr)
 final_proj_data %>% 
 mutate(State=unlist(lapply(strsplit(County,", "),function(x) x[2])),
       County=gsub(",.*","",County))
    ID         County Population Year   State
1 1003 Baldwin County     169162 2006 Alabama
2 1015 Calhoun County     112903 2006 Alabama
3 1043 Cullman County      80187 2006 Alabama
4 1049  DeKalb County      68014 2006 Alabama

原始

使用dplyrtidyr(刚刚看到@Ronak Shah在上面发表了相同的评论):

library(dplyr)
library(tidyr)
final_proj_data %>% 
   separate(County,c("County","State"),sep=",")
    ID         County    State Population Year
1 1003 Baldwin County  Alabama     169162 2006
2 1015 Calhoun County  Alabama     112903 2006
3 1043 Cullman County  Alabama      80187 2006
4 1049  DeKalb County  Alabama      68014 2006

答案 1 :(得分:2)

我们可以在此处尝试使用sub作为基本的R选项:

County <- sub(",.*$", "", final_proj_data$County)
State <- sub("^.*,\\s*", "", final_proj_data$County)
final_proj_data$County <- County
final_proj_data$State <- State

答案 2 :(得分:2)

我们可以在base R中使用read.csv

final_proj_data[c("County", "State")] <- read.csv(text = final_proj_data$County, 
              header = FALSE, stringsAsFactors = FALSE, strip.white = TRUE)
final_proj_data
#    ID         County Population Year   State
#1 1003 Baldwin County     169162 2006 Alabama
#2 1015 Calhoun County     112903 2006 Alabama
#3 1043 Cullman County      80187 2006 Alabama
#4 1049  DeKalb County      68014 2006 Alabama

数据

final_proj_data <- structure(list(ID = c(1003L, 1015L, 1043L, 1049L), 
   County = c("Baldwin County, Alabama", 
"Calhoun County, Alabama", "Cullman County, Alabama", "DeKalb County, Alabama"
), Population = c(169162L, 112903L, 80187L, 68014L), Year = c(2006L, 
2006L, 2006L, 2006L)), class = "data.frame", row.names = c(NA, 
-4L))

答案 3 :(得分:1)

我们可以在底数R中使用strsplit

cbind(d, `colnames<-`(do.call(rbind, strsplit(d$County, ", ")), c("County", "State")))[-2]
#     ID Population Year         County   State
# 1 1003     169162 2006 Baldwin County Alabama
# 2 1015     112903 2006 Calhoun County Alabama
# 3 1043      80187 2006 Cullman County Alabama
# 4 1049      68014 2006  DeKalb County Alabama

注意:如果strsplit(as.character(d$County), ", ")是因子列,请使用County

数据

d <- structure(list(ID = c("1003", "1015", "1043", "1049"), County = c("Baldwin County, Alabama", 
"Calhoun County, Alabama", "Cullman County, Alabama", "DeKalb County, Alabama"
), Population = c("169162", "112903", "80187", "68014"), Year = c("2006", 
"2006", "2006", "2006")), row.names = c(NA, -4L), class = "data.frame")