我有一份未清洁的城市和州的名单来自"位置"在推特中,例如:
location <- c("the Great Lake State", "PA", "Harrisburg, Pennsylvania",
"Pennsylvania", "MI", "Detroit,MI")
如何清理数据以生成包含城市和州的两列的清洁列表?
答案 0 :(得分:0)
你可以这样做:
splitted_list <- strsplit(location,",")
wide_matrix <- sapply(splitted_list,function(x) c(rep(NA,length(x)==1),x))
res <- setNames(data.frame(t(wide_matrix),stringsAsFactors = FALSE),c("city","state"))
res
# city state
# 1 <NA> the Great Lake State
# 2 <NA> PA
# 3 Harrisburg Pennsylvania
# 4 <NA> Pennsylvania
# 5 <NA> MI
# 6 Detroit MI
答案 1 :(得分:0)
假设您的数据(location
)已经是要清理的data.frame的一部分,那么tidyr::separate
可能是合适的选项。
location <- c("the Great Lake State", "PA", "Harrisburg, Pennsylvania",
"Pennsylvania", "MI", "Detroit,MI")
library(tidyverse)
as.data.frame(location) %>% # I created a data.frame, which is not needed in actual data
tidyr::separate(location, c("City", "State"), sep=",", fill="left")
# City State
# 1 <NA> the Great Lake State
# 2 <NA> PA
# 3 Harrisburg Pennsylvania
# 4 <NA> Pennsylvania
# 5 <NA> MI
# 6 Detroit MI