我有一个问题,我希望有人可以帮助我。它基本上是数据操作。我有一个大的数据集,包括10列,“id”和3组类似的变量,“type”,“startdate”和“enddate”。下面是一个例子。
id type1 startdate1 enddate1 type2 startdate2 enddate2 type3 startdate3
1 1 A 2006-08-20 2006-12-06 W 2006-08-01 2007-08-29 P 2007-08-18
2 2 A 2006-01-05 2007-07-02 NA NA NA Q 2008-01-15
enddate3
1 2007-09-27
2 2008-02-07
我想获得以下清理和排序的数据集:
id type1 startdate1 enddate1 type2 startdate2 enddate2 type3 startdate3
1 1 W 2006-08-01 2007-08-29 A 2006-08-20 2006-12-06 P 2007-08-18
2 2 A 2006-01-05 2007-07-02 Q 2008-01-15 2008-02-07 NA NA
enddate3
1 2007-09-27
2 NA
我想按照“startdate”按升序排序,每行/观察。因此对于第1行,由于第二组或一组变量与第一组的“startdate”(2006-08-20)相比具有更早的“startdate”(2006-08-01),我将它放在第一组位置。
对于第2行,我想将所有的NA推到最后。
有关如何有效执行此操作的任何提示?
我应该将“startdate”和“enddate”的数据类型转换为数字吗?如果我应该,我应该如何处理“NA”?
对所有3组的(type,startdate,enddate)应用paste()函数是否明智?
感谢任何帮助!提前谢谢!
答案 0 :(得分:2)
与Mikko Marttila相同,但没有使用非标准库:
> ## use vectors of class Date
> df[c(3,4,6,7,9,10)] <- lapply(df[c(3,4,6,7,9,10)], as.Date)
> ## reshape to long format
> df.1 <- reshape(df, idvar=1,
+ varying=list(c(2,5,8), c(3,6,9), c(4,7,10)),
+ v.names=c('type', 'startdate', 'enddate'),
+ times=c(1,2,3), timevar='group', direction='long')
> df.1
# id group type startdate enddate
# 1.1 1 1 A 2006-08-20 2006-12-06
# 2.1 2 1 A 2006-01-05 2007-07-02
# 1.2 1 2 W 2006-08-01 2007-08-29
# 2.2 2 2 <NA> <NA> <NA>
# 1.3 1 3 P 2007-08-18 2007-09-27
# 2.3 2 3 Q 2008-01-15 2008-02-07
> ## reset group variable according to startdate
> df.1$group <- with(df.1, unsplit(lapply(split(startdate, id), order), id))
> df.1
# id group type startdate enddate
# 1.1 1 2 A 2006-08-20 2006-12-06
# 2.1 2 1 A 2006-01-05 2007-07-02
# 1.2 1 1 W 2006-08-01 2007-08-29
# 2.2 2 3 <NA> <NA> <NA>
# 1.3 1 3 P 2007-08-18 2007-09-27
# 2.3 2 2 Q 2008-01-15 2008-02-07
> ## back to wide format
> df.2 <- reshape(df.1[order(df.1$group), ], idvar=1,
+ v.names=c('type', 'startdate', 'enddate'), timevar='group',
+ direction='wide')
> ## sort by id
> df.2 <- df.2[order(df.2$id), ]
> df.2
# id type.1 startdate.1 enddate.1 type.2 startdate.2 enddate.2 type.3
# 1.2 1 W 2006-08-01 2007-08-29 A 2006-08-20 2006-12-06 P
# 2.1 2 A 2006-01-05 2007-07-02 Q 2008-01-15 2008-02-07 <NA>
# startdate.3 enddate.3
# 1.2 2007-08-18 2007-09-27
# 2.1 <NA> <NA>
答案 1 :(得分:1)
我们可以使用rbind.fill
包中的plyr
。现在,该功能足够智能,可以根据列名进行组合 - 我们不希望这样。为了向前推进每行的观察,我们删除NA,然后将原始数据帧的名称应用于新的向量。
library(plyr)
df <- data.frame("obs" = seq(3),
type1 = c(2,2,NA),date1 = c("date11","date21",NA),
type2 = c(3,NA,5),date2 = c("date12",NA,"date31"),
type3 = c(4,3,1), date3 = c("date13","date22","date32"),
type4 = c(4,4,NA),date4 = c("date14","date23",NA))
df
# obs type1 date1 type2 date2 type3 date3 type4 date4
# 1 1 2 date11 3 date12 4 date13 4 date14
# 2 2 2 date21 NA <NA> 3 date22 4 date23
# 3 3 NA <NA> 5 date31 1 date32 NA <NA>
newdf <- sapply(1:nrow(df), function(i){
newrow <- (df[i,!is.na(df[i,])]) ## Remove NA's
names(newrow) <- names(df)[1:length(newrow)] ## Apply names
newrow ## Output
})
rbind.fill(newdf)
# obs type1 date1 type2 date2 type3 date3 type4 date4
# 1 1 2 date11 3 date12 4 date13 4 date14
# 2 2 2 date21 3 date22 4 date23 NA <NA>
# 3 3 5 date31 1 date32 NA <NA> NA <NA>
警告:此代码仅在type
和日期汇总为观察或NA时才有效。
答案 2 :(得分:1)
这是一个使用dplyr
和tidyr
的解决方案,它依赖于将数据集转换为长格式,根据需要重新排序,然后转换回宽格式。转换为长格式会将值强制转换为character
,因此需要重新应用列类型。
library(tidyr)
library(dplyr)
df <- read.table(header = TRUE, text = "
id type1 startdate1 enddate1 type2 startdate2 enddate2 type3 startdate3 enddate3
1 A 2006-08-20 2006-12-06 W 2006-08-01 2007-08-29 P 2007-08-18 2007-09-27
2 A 2006-01-05 2007-07-02 NA NA NA Q 2008-01-15 2008-02-07
")
df %>%
gather(key, value, -id) %>% # convert to long format
extract(key, c("var", "seq"), "(.*)(\\d)") %>% # extract sequence number
spread(var, value) %>% # spread to wide format by id and sequence
group_by(id) %>%
arrange(startdate) %>% # sort seq by startdate in id groups
mutate(seq = 1:n()) %>% # calculate new sequence order
gather(key, value, -id, -seq) %>% # convert to long format
transmute(var = paste0(key, seq), value) %>% # generate wide format names
spread(var, value) %>% # spread to back to wide format
select(one_of(names(df))) %>% # restore original column order
mutate_each("as.Date", one_of(grep("date", names(df), value = TRUE)))
# reapply date type to original date variables
# Source: local data frame [2 x 10]
# Groups: id [2]
#
# id type1 startdate1 enddate1 type2 startdate2 enddate2 type3 startdate3 enddate3
# (int) (chr) (date) (date) (chr) (date) (date) (chr) (date) (date)
# 1 1 W 2006-08-01 2007-08-29 A 2006-08-20 2006-12-06 P 2007-08-18 2007-09-27
# 2 2 A 2006-01-05 2007-07-02 Q 2008-01-15 2008-02-07 NA <NA> <NA>