重塑数据帧以进行预测

时间:2016-03-17 00:31:49

标签: r reshape reshape2

我今天刚刚收到了reshape这个包裹,我很难理解它是如何运作的。

我有以下数据框:

name  workoutnum  time  weight   raceid     final position
tommy      1       12     140       1             2
tommy      2       14     140       1             2 
tommy      3       11     140       1             2
sarah      1       10     115       1             1
sarah      2       10     115       1             1
sarah      3       11     115       1             1
sarah      4       15     115       1             1

我怎么把所有这些放在一行?所以数据框看起来像:

    name  workoutnum1 workoutnum2 workoutnum3 workoutnum4 time1 time2 time3 time4 weight raceid final_position
   tommy     1            1           1           0        12     14   11    NA     140     1           2  
   sarah     1            1           1           1        10     10   11    15     115     1           1

因此所有列都将附加到锻炼值。

这甚至是正确的方法吗?

2 个答案:

答案 0 :(得分:1)

reshape似乎是您想要做的事情的自然组成部分,但却不会让您一路走来。

这是一种reshape2方法,可以完全融合数据,然后将其转换回data.frame,并在此过程中进行一些调整以获得所需的输出。

请注意,在对melt()的调用中,id.vars参数中的变量将保持宽泛。然后在dcast()中,广泛投射的变量位于~的RHS上。

library(reshape2)
library(dplyr)

# fully melt the data
d_melt <- melt(d, id.vars = c("name", "raceid", "position", "weight"))
# index the variables within name and variable
d_melt <- d_melt %>%
  group_by(name, variable) %>%
  mutate(i = row_number(),
         wide_variable = paste0(variable, i))

# cast as wide
d_wide <- dcast(d_melt, name + raceid + position + weight ~ wide_variable, value.var = "value")
# replace the workoutnum indices with indicators for missingness 
d_wide %>% mutate_each(funs(ifelse(!is.na(.), 1L, 0L)), matches("workoutnum\\d"))
#    name raceid position weight time1 time2 time3 time4 workoutnum1 workoutnum2
# 1 sarah      1        1    115    10    10    11    15           1           1
# 2 tommy      1        2    140    12    14    11    NA           1           1
#   workoutnum3 workoutnum4
# 1           1           1
# 2           1           0

数据:

structure(list(name = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("sarah", "tommy"), class = "factor"), workoutnum = c(1L, 2L, 3L, 1L, 2L, 3L, 4L), time = c(12L, 14L, 11L, 10L, 10L, 11L, 15L), weight = c(140L, 140L, 140L, 115L, 115L, 115L, 115L), raceid = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), position = c(2L, 2L, 2L, 1L, 1L, 1L, 1L)), .Names = c("name", "workoutnum", "time", "weight", "raceid", "position"), class = "data.frame", row.names = c(NA, -7L)) 

答案 1 :(得分:1)

这是一种使用“data.table”中的dcast的方法,它更像一个基础R中的reshape函数。

我对数据所做的唯一改变是包含了另一个“时间”变量,正如@rawr在评论中所指出的,它几乎就像你的“workoutnum” 时间变量。

我使用了“splitstackshape”包中的getanID生成“time”变量,但您可以通过多种不同的方式创建此变量。

library(splitstackshape)
dcast(getanID(mydf, c("name", "raceid", "final_position")), 
      name + raceid + final_position ~ .id, 
      value.var = c("workoutnum", "time", "weight"))

##     name raceid final_position workoutnum_1 workoutnum_2 workoutnum_3
## 1: sarah      1              1            1            2            3
## 2: tommy      1              2            1            2            3
##    workoutnum_4 time_1 time_2 time_3 time_4 weight_1 weight_2 weight_3 weight_4
## 1:            4     10     10     11     15      115      115      115      115
## 2:           NA     12     14     11     NA      140      140      140       NA

如果您使用getanID,也可以像这样使用reshape

reshape(getanID(mydf, c("name", "raceid", "final_position")), 
        idvar = c("name", "raceid", "final_position"), timevar = ".id", 
        direction = "wide")
##     name raceid final_position workoutnum.1 time.1 weight.1 workoutnum.2 time.2
## 1: tommy      1              2            1     12      140            2     14
## 2: sarah      1              1            1     10      115            2     10
##    weight.2 workoutnum.3 time.3 weight.3 workoutnum.4 time.4 weight.4
## 1:      140            3     11      140           NA     NA       NA
## 2:      115            3     11      115            4     15      115

dcast一般会更有效率。