将data.frame重组为长格式会将我的数字转换为NA' s?

时间:2017-06-07 20:37:18

标签: r dataframe reshape2 melt

问题

我在R中有下表,每个工资显示一行:

> df
  employee employment start_date  end_date salary
1      Ian          1  28Jul2010 28Jul2011  20000
2     Rose          1  28Jul2011 28Jul2012  30000
3     Rose          2  28Jul2012 28Jul2013  31000

我希望将其转换为以下结构,每个员工显示一行:

> df2
  employee start_date_employement_1 end_date_employment_1 salary_employement_1 start_date_employement_2 end_date_employment_2 salary_employement_2
1      Ian                28Jul2010             28Jul2011                20000                     <NA>                  <NA>                   NA
2     Rose                28Jul2011             28Jul2012                30000                28Jul2012             28Jul2013                31000

不幸的是,我无法看到如何做到这一点,并希望得到一些帮助。

注意:创建上表的R代码位于本文末尾。

我失败的方法

这是一个数据重组问题,因此我认为reshape / reshape2包是前进的方法。

我可以运行基本融合和演员示例,但无法看到如何将其应用于我的具体问题。当我尝试时,我的工资价值消失了,我不确定为什么(似乎将我的工资解释为因素而不是数字?):

library(reshape2)
library(dplyr)
> melt(df, id.vars = c("employee", "employment")) %>% 
    arrange(employee, employment)
  employee employment   variable     value
1      Ian          1 start_date 28Jul2010
2      Ian          1   end_date 28Jul2011
3      Ian          1     salary      <NA>
4     Rose          1 start_date 28Jul2011
5     Rose          1   end_date 28Jul2012
6     Rose          1     salary      <NA>
7     Rose          2 start_date 28Jul2012
8     Rose          2   end_date 28Jul2013
9     Rose          2     salary      <NA>
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = c(20000L, 30000L, 31000L)) :
  invalid factor level, NA generated

但是,如果上述方法有效,那么我就这样做了:

melt(df, id.vars = c("employee", "employment")) %>% 
  arrange(employee, employment) %>%
  mutate(variable = paste(variable, employment, sep="_")) %>%
  select(employee, variable, value) %>%
  cast()

  employee end_date_1 end_date_2 salary_1 salary_2 start_date_1 start_date_2
1      Ian  28Jul2011       <NA>     <NA>     <NA>    28Jul2010         <NA>
2     Rose  28Jul2012  28Jul2013     <NA>     <NA>    28Jul2011    28Jul2012

这几乎是我想要的,除了NA和列的排序。

数据

df <- 
  structure(list(employee = c("Ian", "Rose", "Rose"), 
               employment = c(1L, 1L, 2L), 
               start_date = c("28Jul2010", "28Jul2011", "28Jul2012"), 
               end_date = c("28Jul2011", "28Jul2012", "28Jul2013"), 
               salary = c(20000.00, 30000.00, 31000.00)), 
          .Names = c("employee", "employment", "start_date", "end_date", "salary"), 
          sorted = c("employee", "employment"), class = c("data.frame"), row.names = c(NA, -3L))


df2 <- 
  structure(list(employee = c("Ian", "Rose"), start_date_employement_1 = c("28Jul2010", "28Jul2011"), 
                 end_date_employment_1 = c("28Jul2011", "28Jul2012"), 
                 salary_employement_1 = c(20000L, 30000L), 
                 start_date_employement_2 = c(NA, "28Jul2012"), 
                 end_date_employment_2 = c(NA, "28Jul2013"), 
                 salary_employement_2 = c(NA, 31000L)), 
            .Names = c("employee", "start_date_employement_1", "end_date_employment_1", "salary_employement_1", "start_date_employement_2", "end_date_employment_2", "salary_employement_2"), 
            class = "data.frame", row.names = c(NA, -2L))

1 个答案:

答案 0 :(得分:3)

您打算使用dcast从长到宽重塑数据框; reshape2::dcast似乎不支持多个 value.var 列。您可以使用baseR中的reshape

reshape(df, direction = "wide", idvar = "employee", timevar = "employment")

#  employee start_date.1 end_date.1 salary.1 start_date.2 end_date.2 salary.2
#1      Ian    28Jul2010  28Jul2011    20000         <NA>       <NA>       NA
#2     Rose    28Jul2011  28Jul2012    30000    28Jul2012  28Jul2013    31000

或使用data.table::dcast

library(data.table)
dcast(setDT(df), employee ~ employment, value.var = c("start_date", "end_date", "salary"))
#   employee start_date_1 start_date_2 end_date_1 end_date_2 salary_1 salary_2
#1:      Ian    28Jul2010           NA  28Jul2011         NA    20000       NA
#2:     Rose    28Jul2011    28Jul2012  28Jul2012  28Jul2013    30000    31000