Question

我有一个问题，我希望有人可以帮助我。它基本上是数据操作。我有一个大的数据集，包括10列，“id”和3组类似的变量，“type”，“startdate”和“enddate”。下面是一个例子。

  id type1 startdate1   enddate1 type2 startdate2   enddate2 type3 startdate3
1  1     A 2006-08-20 2006-12-06     W 2006-08-01 2007-08-29     P 2007-08-18
2  2     A 2006-01-05 2007-07-02    NA         NA         NA     Q 2008-01-15

    enddate3
1 2007-09-27
2 2008-02-07

我想获得以下清理和排序的数据集：

  id type1 startdate1   enddate1 type2 startdate2   enddate2 type3 startdate3
1  1     W 2006-08-01 2007-08-29     A 2006-08-20 2006-12-06     P 2007-08-18
2  2     A 2006-01-05 2007-07-02     Q 2008-01-15 2008-02-07    NA         NA 

enddate3
1 2007-09-27
2 NA

我想按照“startdate”按升序排序，每行/观察。因此对于第1行，由于第二组或一组变量与第一组的“startdate”（2006-08-20）相比具有更早的“startdate”（2006-08-01），我将它放在第一组位置。

对于第2行，我想将所有的NA推到最后。

有关如何有效执行此操作的任何提示？

我应该将“startdate”和“enddate”的数据类型转换为数字吗？如果我应该，我应该如何处理“NA”？

对所有3组的（type，startdate，enddate）应用paste（）函数是否明智？

感谢任何帮助！提前谢谢！

Answer 1

与Mikko Marttila相同，但没有使用非标准库：

> ## use vectors of class Date
> df[c(3,4,6,7,9,10)] <- lapply(df[c(3,4,6,7,9,10)], as.Date)

> ## reshape to long format
> df.1 <- reshape(df, idvar=1,
+                 varying=list(c(2,5,8), c(3,6,9), c(4,7,10)),
+                 v.names=c('type', 'startdate', 'enddate'),
+                 times=c(1,2,3), timevar='group', direction='long')
> df.1
#     id group type  startdate    enddate
# 1.1  1     1    A 2006-08-20 2006-12-06
# 2.1  2     1    A 2006-01-05 2007-07-02
# 1.2  1     2    W 2006-08-01 2007-08-29
# 2.2  2     2 <NA>       <NA>       <NA>
# 1.3  1     3    P 2007-08-18 2007-09-27
# 2.3  2     3    Q 2008-01-15 2008-02-07

> ## reset group variable according to startdate
> df.1$group <- with(df.1, unsplit(lapply(split(startdate, id), order), id))
> df.1
#     id group type  startdate    enddate
# 1.1  1     2    A 2006-08-20 2006-12-06
# 2.1  2     1    A 2006-01-05 2007-07-02
# 1.2  1     1    W 2006-08-01 2007-08-29
# 2.2  2     3 <NA>       <NA>       <NA>
# 1.3  1     3    P 2007-08-18 2007-09-27
# 2.3  2     2    Q 2008-01-15 2008-02-07

> ## back to wide format
> df.2 <- reshape(df.1[order(df.1$group), ], idvar=1,
+                 v.names=c('type', 'startdate', 'enddate'), timevar='group',
+                 direction='wide')

> ## sort by id
> df.2 <- df.2[order(df.2$id), ]

> df.2
#     id type.1 startdate.1  enddate.1 type.2 startdate.2  enddate.2 type.3
# 1.2  1      W  2006-08-01 2007-08-29      A  2006-08-20 2006-12-06      P
# 2.1  2      A  2006-01-05 2007-07-02      Q  2008-01-15 2008-02-07   <NA>
#     startdate.3  enddate.3
# 1.2  2007-08-18 2007-09-27
# 2.1        <NA>       <NA>

Answer 2

我们可以使用rbind.fill包中的plyr。现在，该功能足够智能，可以根据列名进行组合 - 我们不希望这样。为了向前推进每行的观察，我们删除NA，然后将原始数据帧的名称应用于新的向量。

library(plyr)

df <- data.frame("obs" = seq(3),
                 type1 = c(2,2,NA),date1 = c("date11","date21",NA), 
                 type2 = c(3,NA,5),date2 = c("date12",NA,"date31"),
                 type3 = c(4,3,1), date3 = c("date13","date22","date32"),
                 type4 = c(4,4,NA),date4 = c("date14","date23",NA))
df
#    obs type1  date1 type2  date2 type3  date3 type4  date4
#    1   1     2 date11     3 date12     4 date13     4 date14
#    2   2     2 date21    NA   <NA>     3 date22     4 date23
#    3   3    NA   <NA>     5 date31     1 date32    NA   <NA>

newdf <- sapply(1:nrow(df), function(i){
    newrow <- (df[i,!is.na(df[i,])])              ## Remove NA's
    names(newrow) <- names(df)[1:length(newrow)]  ## Apply names

    newrow                                        ## Output
})

rbind.fill(newdf)
#    obs type1  date1 type2  date2 type3  date3 type4  date4
#    1   1     2 date11     3 date12     4 date13     4 date14
#    2   2     2 date21     3 date22     4 date23    NA   <NA>
#    3   3     5 date31     1 date32    NA   <NA>    NA   <NA>

警告：此代码仅在type和日期汇总为观察或NA时才有效。

Answer 3

这是一个使用dplyr和tidyr的解决方案，它依赖于将数据集转换为长格式，根据需要重新排序，然后转换回宽格式。转换为长格式会将值强制转换为character，因此需要重新应用列类型。

library(tidyr)
library(dplyr)

df <- read.table(header = TRUE, text = "
id type1 startdate1   enddate1 type2 startdate2   enddate2 type3 startdate3   enddate3
 1     A 2006-08-20 2006-12-06     W 2006-08-01 2007-08-29     P 2007-08-18 2007-09-27
 2     A 2006-01-05 2007-07-02    NA         NA         NA     Q 2008-01-15 2008-02-07
")

df %>%
    gather(key, value, -id) %>%  # convert to long format
    extract(key, c("var", "seq"), "(.*)(\\d)") %>%  # extract sequence number
    spread(var, value) %>%  # spread to wide format by id and sequence
    group_by(id) %>%
    arrange(startdate) %>%  # sort seq by startdate in id groups
    mutate(seq = 1:n()) %>%  # calculate new sequence order
    gather(key, value, -id, -seq) %>%  # convert to long format
    transmute(var = paste0(key, seq), value) %>%  # generate wide format names
    spread(var, value) %>%  # spread to back to wide format
    select(one_of(names(df))) %>%  # restore original column order
    mutate_each("as.Date", one_of(grep("date", names(df), value = TRUE)))
        # reapply date type to original date variables

#     Source: local data frame [2 x 10]
#     Groups: id [2]
#     
#          id type1 startdate1   enddate1 type2 startdate2   enddate2 type3 startdate3   enddate3
#       (int) (chr)     (date)     (date) (chr)     (date)     (date) (chr)     (date)     (date)
#     1     1     W 2006-08-01 2007-08-29     A 2006-08-20 2006-12-06     P 2007-08-18 2007-09-27
#     2     2     A 2006-01-05 2007-07-02     Q 2008-01-15 2008-02-07    NA       <NA>       <NA>

R：如何逐行地根据属性值（日期）进行组排序？

3 个答案: