我有一个“混乱”的数据集,我正在努力转换为整洁的格式。这是数据的样子:
make <- c("Honda", "Jeep", "Nissan", "Ford")
model <- c("Civic", "Wrangler", "Altima", "Focus")
year <- c(1996, 2000, 2005, 1988)
color <- c("red;green;blue", "red;blue", "purple;red;green;black", "yellow;white;blue")
car.df <- data.frame(make, model, year, color)
我想要做的是通过分离出“颜色”字段然后将每个品牌/型号/年份/颜色分开来将数据转换为整齐/长格式。因此输出看起来像这样(仅适用于本田和吉普):
make.new <- c("Honda", "Honda", "Honda", "Jeep", "Jeep")
model.new <- c("Civic", "Civic", "Civic", "Wrangler", "Wrangler")
year.new <- c(1996, 1996, 1996, 2000, 2000)
color.new <- c("red", "green", "blue", "red", "blue")
car.df.new <- data.frame(make.new, model.new, year.new, color.new)
有关如何执行此操作的任何建议?在数据集中,可以有许多不同的颜色,因此一旦将颜色字段分成不同的列,您可能会有许多不同的列进行整理(因此,每个整数数据集中的每行都有不同的行数)品牌/型号/年)。
感谢任何有用的建议!
史蒂夫
答案 0 :(得分:2)
cSplit
的 splitstackshape
可以以紧凑的方式完成此操作。指定splitCols
(&#34;颜色&#34;),sep
(&#34 ;;&#34;)和direction
(&#34; long&#34; ),它将给出预期的输出。
library(splitstackshape)
cSplit(car.df, "color", ";", "long")
# make model year color
# 1: Honda Civic 1996 red
# 2: Honda Civic 1996 green
# 3: Honda Civic 1996 blue
# 4: Jeep Wrangler 2000 red
# 5: Jeep Wrangler 2000 blue
# 6: Nissan Altima 2005 purple
# 7: Nissan Altima 2005 red
# 8: Nissan Altima 2005 green
# 9: Nissan Altima 2005 black
#10: Ford Focus 1988 yellow
#11: Ford Focus 1988 white
#12: Ford Focus 1988 blue
如果我们需要dplyr/tidyr
解决方案
library(dplyr)
library(tidyr)
library(stringr)
separate(car.df, color, into = paste0("color", seq(max(str_count(color,
";"))+1)), fill="right") %>%
gather(Var, color, - make, -model, -year) %>%
select(-Var)