如何在重塑R之前扩展数据

时间:2015-02-12 15:02:05

标签: r reshape expand

我有一个如下所示的数据框:

as.is <- data.frame(Project = c('Proj A', 'Proj B', 'Proj C', 'Proj D'), 
               Start.Date = c('16.02.2015', '02.03.2015', '16.02.2015', '09.03.2015'), 
               Duration = c(3, 2, 2, 4),
               No.Of.Resources = c(3, 5, 2, 6))

我需要更改格式,使其如下所示:

to.be <- data.frame(Project = c('Proj A', 'Proj B', 'Proj C', 'Proj D'),
                '16.02.2015' = c(3, NA, 2, NA),
                '23.02.2015' = c(3, NA, 2, NA),
                '02.03.2015' = c(3, 5, NA, NA),
                '09.03.2015' = c(NA, 5, NA, 6),
                '16.03.2015' = c(NA, NA, NA, 6),
                '23.03.2015' = c(NA, NA, NA, 6),
                '30.03.2015' = c(NA, NA, NA, 6))

我无法弄清楚如何扩展日期,所以我每行得到一个,所以我可以在数据上使用reshape2。我可以得到一份我希望成为我的头衔的日期列表清单,但看不出如何将这些部分放在一起。

解决这个问题的正确方法是什么?

编辑:为了澄清,持续时间是周数,所以我需要生成标题为x,x + 7,x + 14的列......

3 个答案:

答案 0 :(得分:5)

这是一种似乎有用的方法。它使用我的&#34; splitstackshape&#34;中的expandRowsgetanID。包,然后来自&#34; data.table&#34; dcast.data.table将价值观广泛传播:

as.is$Start.Date <- as.Date(as.character(as.is$Start.Date), "%d.%m.%Y")

library(splitstackshape)
dcast.data.table(
  getanID(
    expandRows(as.is, "Duration"), 
    c("Project", "Start.Date"))[
      , Start.Date := Start.Date + (.id-1) * 7], 
  Project ~ Start.Date, value.var = "No.Of.Resources")
#    Project 2015-02-16 2015-02-23 2015-03-02 2015-03-09 2015-03-16 2015-03-23 2015-03-30
# 1:  Proj A          3          3          3         NA         NA         NA         NA
# 2:  Proj B         NA         NA          5          5         NA         NA         NA
# 3:  Proj C          2          2         NA         NA         NA         NA         NA
# 4:  Proj D         NA         NA         NA          6          6          6          6

这是&#34; dplyr&#34;确实能够更好地阅读解决方案:

library(splitstackshape)
library(dplyr)
library(tidyr)

as.is$Start.Date <- as.Date(as.character(as.is$Start.Date), "%d.%m.%Y")
expandRows(as.is, "Duration") %>%                   # expand the data
  getanID(c("Project", "Start.Date")) %>%           # add an "id" column
  mutate(Start.Date = Start.Date + (.id-1) * 7) %>% # recalculate start dates
  select(-.id) %>%                                  # drop the "id" column
  spread(Start.Date, No.Of.Resources)               # reshape long to wide

答案 1 :(得分:5)

我在data.table中做的不同。更新了新解决方案:

library(data.table)
dt = as.data.table(as.is)
dt[, Start.Date := as.Date(Start.Date, '%d.%m.%Y')]

# use dcast.data.table before version 1.9.5
dcast(dt[, list(seq(Start.Date, length.out = Duration, by = '1 week'), No.Of.Resources)
         , by = Project], Project ~ V1)

旧的(以及不必要的复杂)解决方案:

# expand out Start.Date by Project
dates.all = dt[, seq(Start.Date, length.out = Duration, by = '1 week'), by = Project]

# set the key and do a rolling join, then dcast
# (can use just dcast in version 1.9.5+, have to use dcast.data.table before that)
setkey(dt, Project, Start.Date)
dcast(dt[dates.all, roll = TRUE], Project ~ Start.Date)
#   Project 2015-02-16 2015-02-23 2015-03-02 2015-03-09 2015-03-16 2015-03-23 2015-03-30
#1:  Proj A          3          3          3         NA         NA         NA         NA
#2:  Proj B         NA         NA          5          5         NA         NA         NA
#3:  Proj C          2          2         NA         NA         NA         NA         NA
#4:  Proj D         NA         NA         NA          6          6          6          6

答案 2 :(得分:1)

我建议使用tidyr包而不是reshape2。虽然tidyr导入reshape2进行某些操作,但我认为应该考虑它的继任者。

# Convert to Date class to sort the columns correctly
as.is$Start.Date <- as.Date(as.character(as.is$Start.Date), "%d.%m.%Y")

intermediate <- with(as.is, data.frame(
    Project = rep(Project, Duration),
    Date = rep(Start.Date, Duration) +
           7*(unlist(lapply(Duration, seq_len))-1),
    No.Of.Resources = rep(No.Of.Resources, Duration)
))

require(tidyr)
result <- spread(intermediate, Date, No.Of.Resources)

看结果你得到了

  Project 2015-02-16 2015-02-23 2015-03-02 2015-03-09 2015-03-16 2015-03-23
1  Proj A          3          3          3         NA         NA         NA
2  Proj B         NA         NA          5          5         NA         NA
3  Proj C          2          2         NA         NA         NA         NA
4  Proj D         NA         NA         NA          6          6          6
  2015-03-30
1         NA
2         NA
3         NA
4          6

在其上调用dput(result)会产生您要求的内容

structure(list(
    Project = structure(1:4, .Label = c("Proj A", "Proj B", "Proj C", "Proj D"), class = "factor"),
    `2015-02-16` = c(3, NA, 2, NA),
    `2015-02-23` = c(3, NA, 2, NA),
    `2015-03-02` = c(3, 5, NA, NA),
    `2015-03-09` = c(NA, 5, NA, 6),
    `2015-03-16` = c(NA, NA, NA, 6),
    `2015-03-23` = c(NA, NA, NA, 6),
    `2015-03-30` = c(NA, NA, NA, 6)),
    .Names = c("Project", "2015-02-16", "2015-02-23", "2015-03-02", "2015-03-09", "2015-03-16", "2015-03-23", "2015-03-30"),
    class = "data.frame", row.names = c(NA, 4L))