我在R中有下表,每个工资显示一行:
> df
employee employment start_date end_date salary
1 Ian 1 28Jul2010 28Jul2011 20000
2 Rose 1 28Jul2011 28Jul2012 30000
3 Rose 2 28Jul2012 28Jul2013 31000
我希望将其转换为以下结构,每个员工显示一行:
> df2
employee start_date_employement_1 end_date_employment_1 salary_employement_1 start_date_employement_2 end_date_employment_2 salary_employement_2
1 Ian 28Jul2010 28Jul2011 20000 <NA> <NA> NA
2 Rose 28Jul2011 28Jul2012 30000 28Jul2012 28Jul2013 31000
不幸的是,我无法看到如何做到这一点,并希望得到一些帮助。
注意:创建上表的R代码位于本文末尾。
这是一个数据重组问题,因此我认为reshape / reshape2包是前进的方法。
我可以运行基本融合和演员示例,但无法看到如何将其应用于我的具体问题。当我尝试时,我的工资价值消失了,我不确定为什么(似乎将我的工资解释为因素而不是数字?):
library(reshape2)
library(dplyr)
> melt(df, id.vars = c("employee", "employment")) %>%
arrange(employee, employment)
employee employment variable value
1 Ian 1 start_date 28Jul2010
2 Ian 1 end_date 28Jul2011
3 Ian 1 salary <NA>
4 Rose 1 start_date 28Jul2011
5 Rose 1 end_date 28Jul2012
6 Rose 1 salary <NA>
7 Rose 2 start_date 28Jul2012
8 Rose 2 end_date 28Jul2013
9 Rose 2 salary <NA>
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = c(20000L, 30000L, 31000L)) :
invalid factor level, NA generated
但是,如果上述方法有效,那么我就这样做了:
melt(df, id.vars = c("employee", "employment")) %>%
arrange(employee, employment) %>%
mutate(variable = paste(variable, employment, sep="_")) %>%
select(employee, variable, value) %>%
cast()
employee end_date_1 end_date_2 salary_1 salary_2 start_date_1 start_date_2
1 Ian 28Jul2011 <NA> <NA> <NA> 28Jul2010 <NA>
2 Rose 28Jul2012 28Jul2013 <NA> <NA> 28Jul2011 28Jul2012
这几乎是我想要的,除了NA和列的排序。
df <-
structure(list(employee = c("Ian", "Rose", "Rose"),
employment = c(1L, 1L, 2L),
start_date = c("28Jul2010", "28Jul2011", "28Jul2012"),
end_date = c("28Jul2011", "28Jul2012", "28Jul2013"),
salary = c(20000.00, 30000.00, 31000.00)),
.Names = c("employee", "employment", "start_date", "end_date", "salary"),
sorted = c("employee", "employment"), class = c("data.frame"), row.names = c(NA, -3L))
df2 <-
structure(list(employee = c("Ian", "Rose"), start_date_employement_1 = c("28Jul2010", "28Jul2011"),
end_date_employment_1 = c("28Jul2011", "28Jul2012"),
salary_employement_1 = c(20000L, 30000L),
start_date_employement_2 = c(NA, "28Jul2012"),
end_date_employment_2 = c(NA, "28Jul2013"),
salary_employement_2 = c(NA, 31000L)),
.Names = c("employee", "start_date_employement_1", "end_date_employment_1", "salary_employement_1", "start_date_employement_2", "end_date_employment_2", "salary_employement_2"),
class = "data.frame", row.names = c(NA, -2L))
答案 0 :(得分:3)
您打算使用dcast
从长到宽重塑数据框; reshape2::dcast
似乎不支持多个 value.var 列。您可以使用baseR中的reshape
:
reshape(df, direction = "wide", idvar = "employee", timevar = "employment")
# employee start_date.1 end_date.1 salary.1 start_date.2 end_date.2 salary.2
#1 Ian 28Jul2010 28Jul2011 20000 <NA> <NA> NA
#2 Rose 28Jul2011 28Jul2012 30000 28Jul2012 28Jul2013 31000
或使用data.table::dcast
:
library(data.table)
dcast(setDT(df), employee ~ employment, value.var = c("start_date", "end_date", "salary"))
# employee start_date_1 start_date_2 end_date_1 end_date_2 salary_1 salary_2
#1: Ian 28Jul2010 NA 28Jul2011 NA 20000 NA
#2: Rose 28Jul2011 28Jul2012 28Jul2012 28Jul2013 30000 31000