重新整形为宽并保留重复的行

时间:2018-06-04 08:30:48

标签: r reshape

对于给定的数据集,我想将我的数据集从长格式转换为宽格式。我使用了reshape函数来做到这一点。

id  status      timestamp   
1   assigned   2017-01-02  
1   done       2017-01-03  
1   locked     2017-01-04   
2   assigned   2017-01-02   
2   done       2017-01-03  
2   assigned   2017-01-03  
2   done       2017-01-04 
2   locked     2017-01-05  
3   assigned   2017-01-02  
3   done       2017-01-03 
3   locked     2017-01-04 
...

# reshape function to convert long format to Wide.
temp <- reshape(temp, idvar = "id", timevar = "status", direction = "wide")

结果:

id timestamp.assigned timestamp.done timestamp.locked
1 2017-01-02 2017-01-03 2017-01-04
2 2017-01-02 2017-01-03 2017-01-05
3 2017-01-02 2017-01-03 2017-01-04

当我这样做时它删除了一些行,例如:对于id 2,有多个行匹配status=assigned,它占据第一行。

如何在不删除行的情况下转换为宽屏。基本上,我不想丢失任何数据。

预期结果:
id timestamp.assigned timestamp.done timestamp.locked
1 2017-01-02 2017-01-03 2017-01-04
2 2017-01-02 2017-01-03 2017-01-05
2 2017-01-03 2017-01-04 2017-01-05
3 2017-01-02 2017-01-03 2017-01-04

id timestamp.assigned timestamp.done timestamp.locked
1 2017-01-02 2017-01-03 2017-01-04
2 2017-01-02 2017-01-03 NA
2 2017-01-03 2017-01-04 2017-01-05
3 2017-01-02 2017-01-03 2017-01-04

2 个答案:

答案 0 :(得分:0)

您可以做的一件事是添加一个为每个新作业赋予唯一值的变量。然后你可以使用它来塑造你的数据

i <- 0

temp$key <- sapply(temp$status, function(x) {
  if(x == "assigned") {i <<- i+1; i}
  else {i}
})

temp

   id   status  timestamp key
1   1 assigned 2017-01-02   1
2   1     done 2017-01-03   1
3   1   locked 2017-01-04   1
4   2 assigned 2017-01-02   2
5   2     done 2017-01-03   2
6   2 assigned 2017-01-03   3
7   2     done 2017-01-04   3
8   2   locked 2017-01-05   3
9   3 assigned 2017-01-02   4
10  3     done 2017-01-03   4
11  3   locked 2017-01-04   4

temp2 <- reshape(temp, idvar = c("key", "id"), timevar = "status", direction = "wide")

temp2

  id key timestamp.assigned timestamp.done timestamp.locked
1  1   1         2017-01-02     2017-01-03       2017-01-04
4  2   2         2017-01-02     2017-01-03             <NA>
6  2   3         2017-01-03     2017-01-04       2017-01-05
9  3   4         2017-01-02     2017-01-03       2017-01-04

答案 1 :(得分:0)

1。 cumsum()

Esther's approach为每项新工作分配编号是要去的方式。

但是,R已经具有cumsum()函数,可用于此目的:

temp$key <- cumsum(temp$status == "assigned")
reshape(temp, idvar = c("key", "id"), timevar = "status", direction = "wide")
   id key timestamp.assigned timestamp.done timestamp.locked
1:  1   1         2017-01-02     2017-01-03       2017-01-04
2:  2   2         2017-01-02     2017-01-03             <NA>
3:  2   3         2017-01-03     2017-01-04       2017-01-05
4:  3   4         2017-01-02     2017-01-03       2017-01-04

2。已分组cumsum()

尽管这解决了OP的原始问题,但key仅对 all id个中的 all 个分配编号。如果OP希望为每个id分别分配编号,我们需要应用cumsum()分组的id

一种实现此目的的方法是使用data.table语法:

library(data.table)
setDT(temp)[, key := cumsum(status == "assigned"), by = id]
dcast(temp, id + key ~ status, value.var = "timestamp")
   id key   assigned       done     locked
1:  1   1 2017-01-02 2017-01-03 2017-01-04
2:  2   1 2017-01-02 2017-01-03       <NA>
3:  2   2 2017-01-03 2017-01-04 2017-01-05
4:  3   1 2017-01-02 2017-01-03 2017-01-04

dcast()替代了基础R的reshape(..., direction = "wide")函数,该函数可从reshape2data.table包中获得。

3。即时{em}

分组cumsum()

data.table的{​​{1}}的公式接口也接受表达式。这样,就不必在整形之前在 之前附加dcast()列来修改temp。取而代之的是,可以在整形时动态地

key
dcast(temp, id + ave(key <- status == "assigned", id, FUN = cumsum) ~ 
        paste0("timestamp.", status))

数据

   id key timestamp.assigned timestamp.done timestamp.locked
1:  1   1         2017-01-02     2017-01-03       2017-01-04
2:  2   1         2017-01-02     2017-01-03             <NA>
3:  2   2         2017-01-03     2017-01-04       2017-01-05
4:  3   1         2017-01-02     2017-01-03       2017-01-04