将多列组合成整洁的数据

时间:2015-02-25 21:03:03

标签: r dplyr tidyr

我的数据集如下所示:

unique.id abx.1    start.1     stop.1 abx.2    start.2     stop.2 abx.3    start.3     stop.3 abx.4    start.4
1         1  Moxi 2014-01-01 2014-01-07  PenG 2014-01-01 2014-01-07 Vanco 2014-01-01 2014-01-07  Moxi 2014-01-01
2         2  Moxi 2014-01-01 2014-01-02 Cipro 2014-01-01 2014-01-02  PenG 2014-01-01 2014-01-02 Vanco 2014-01-01
3         3 Cipro 2014-01-01 2014-01-05 Vanco 2014-01-01 2014-01-05 Cipro 2014-01-01 2014-01-05 Vanco 2014-01-01
4         4 Vanco 2014-01-02 2014-01-03 Cipro 2014-01-02 2014-01-03 Cipro 2014-01-02 2014-01-03  PenG 2014-01-02
5         5 Vanco 2014-01-01 2014-01-02  PenG 2014-01-01 2014-01-02  PenG 2014-01-01 2014-01-02 Cipro 2014-01-01
      stop.4    intervention
1 2014-01-07       0
2 2014-01-02       0
3 2014-01-05       1
4 2014-01-03       1
5 2014-01-02       0

使用一些代码来创建它:

 abxoptions <- c("Cipro", "Moxi", "PenG", "Vanco")
      df3 <- data.frame(
      unique.id = 1:5,
      abx.1 = sample(abxoptions,5, replace=TRUE),
      start.1 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
      stop.1  = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
      abx.2 = sample(abxoptions,5, replace=TRUE),         
      start.2 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
      stop.2  = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
      abx.3 = sample(abxoptions,5, replace=TRUE),         
      start.3 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
      stop.3  = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
      abx.4 = sample(abxoptions,5, replace=TRUE),         
      start.4 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
      stop.4  = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
      intervention = c(0,0,1,1,0)

我想将这些数据整理成这样:

unique.id    abx     start    stop           intervention
1            Moxi    2014-01-10 2014-01-07      0
1            Pen G   2014-01-01 2014-01-07      0
1            Vanco   2014-01-01 2014-01-07      0
1            Moxi    2014-01-01 2014-01-07      0  etc etc

以下解决方案并没有让我得到我需要的地方: Gather multiple sets of columnsCombining multiple columns into one

我怀疑Hadley令人惊叹的tidyr pakcage是要走的路......只是无法解决这个问题。任何帮助将不胜感激。

3 个答案:

答案 0 :(得分:10)

几乎每个数据整理问题都可以通过三个步骤解决:

  1. 收集所有非变量列
  2. 将“colname”列分隔为多个变量
  3. 重新传播数据
  4. (通常你只需要其中的一两个,但我认为它们几乎总是按照这个顺序)。

    对于您的数据:

    1. 唯一已经是变量的列是unique.id
    2. 您需要将当前列名拆分为变量和数字
    3. 然后你需要将“变量”变量放回到列
    4. 这看起来像:

      library(tidyr)
      library(dplyr)
      
      df3 %>%
        gather(col, value, -unique.id, -intervention) %>%
        separate(col, c("variable", "number")) %>%
        spread(variable, value, convert = TRUE) %>%
        mutate(start = as.Date(start, "1970-01-01"), stop = as.Date(stop, "1970-01-01"))
      

      你的情况有点复杂,因为你有两种类型的变量,所以你需要在最后恢复类型。

答案 1 :(得分:7)

您可以尝试reshape

中的base R
reshape(df3, direction='long', varying=2:ncol(df3), sep=".")

或使用merged.stack

中的splitstackshape
 library(splitstackshape)
 merged.stack(df3, var.stubs=c('abx', 'start', 'stop'), sep='.')[,
    c('start', 'stop') := lapply(.SD, as.Date,
                   origin='1970-01-01'), .SDcols=4:5][]

答案 2 :(得分:4)

最近,melt.data.table添加了一项新功能,可以无痛地融入多个列。您要做的就是在list measure.vars参数中提供您想要分别融合的列。

您可以按照these instructions获取开发版本。

require(data.table) ## v1.9.5
setDT(dat) # dat is now a data.table
melt(dat, id = 1L, measure = patterns("^abx", "^start", "^stop"), 
          value.name = c("abx", "start", "stop"))

#     unique.id variable   abx      start       stop
#  1:         1        1  Moxi 2014-01-01 2014-01-07
#  2:         2        1  Moxi 2014-01-01 2014-01-02
#  3:         3        1 Cipro 2014-01-01 2014-01-05
#  4:         4        1 Vanco 2014-01-02 2014-01-03
#  5:         5        1 Vanco 2014-01-01 2014-01-02
#  6:         1        2  PenG 2014-01-01 2014-01-07
#  7:         2        2 Cipro 2014-01-01 2014-01-02
#  8:         3        2 Vanco 2014-01-01 2014-01-05
#  9:         4        2 Cipro 2014-01-02 2014-01-03
# 10:         5        2  PenG 2014-01-01 2014-01-02
# 11:         1        3 Vanco 2014-01-01 2014-01-07
# 12:         2        3  PenG 2014-01-01 2014-01-02
# 13:         3        3 Cipro 2014-01-01 2014-01-05
# 14:         4        3 Cipro 2014-01-02 2014-01-03
# 15:         5        3  PenG 2014-01-01 2014-01-02
# 16:         1        4  Moxi 2014-01-01 2014-01-07
# 17:         2        4 Vanco 2014-01-01 2014-01-02
# 18:         3        4 Vanco 2014-01-01 2014-01-05
# 19:         4        4  PenG 2014-01-02 2014-01-03
# 20:         5        4 Cipro 2014-01-01 2014-01-02

我在这里使用了列号,但您也可以提供列名。