如何使用R清洁城市和州(包括完整和缩写)

时间:2018-06-03 02:48:08

标签: r replace lookup

我有一份未清洁的城市和州的名单来自"位置"在推特中,例如:

location <- c("the Great Lake State", "PA", "Harrisburg, Pennsylvania",
              "Pennsylvania", "MI", "Detroit,MI")

如何清理数据以生成包含城市和州的两列的清洁列表?

desired output

2 个答案:

答案 0 :(得分:0)

你可以这样做:

splitted_list <- strsplit(location,",")
wide_matrix   <- sapply(splitted_list,function(x) c(rep(NA,length(x)==1),x))
res <- setNames(data.frame(t(wide_matrix),stringsAsFactors = FALSE),c("city","state"))
res
#         city                state
# 1       <NA> the Great Lake State
# 2       <NA>                   PA
# 3 Harrisburg         Pennsylvania
# 4       <NA>         Pennsylvania
# 5       <NA>                   MI
# 6    Detroit                   MI

答案 1 :(得分:0)

假设您的数据(location)已经是要清理的data.frame的一部分,那么tidyr::separate可能是合适的选项。

location <- c("the Great Lake State", "PA", "Harrisburg, Pennsylvania",
              "Pennsylvania", "MI", "Detroit,MI")


library(tidyverse)

as.data.frame(location) %>% # I created a data.frame, which is not needed in actual data 
  tidyr::separate(location, c("City", "State"), sep=",", fill="left")

#         City                State
# 1       <NA> the Great Lake State
# 2       <NA>                   PA
# 3 Harrisburg         Pennsylvania
# 4       <NA>         Pennsylvania
# 5       <NA>                   MI
# 6    Detroit                   MI