我在R中使用存储为data.table
的长表,其中包含在数值和字符类型变量的值更改中获得的值。当我想执行一些函数,如相关性,回归等时,我必须将表转换为宽格式并使时间戳频率均匀化。
我找到了将长表转换为宽表的方法,但我认为效率并不高,我想知道是否有更好的data.table
原生方法。
在下面的可重现的例子中,我包括了我发现的两个选项来执行广泛的低变换,在评论中我指出了我认为哪些部分不是最优的。
library(zoo)
library(data.table)
dt<-data.table(time=1:6,variable=factor(letters[1:6]),numeric=c(1:3,rep(NA,3)),
character=c(rep(NA,3),letters[1:3]),key="time")
print(dt)
print(dt[,lapply(.SD,typeof)])
#option 1
casted<-dcast(dt,time~variable,value.var=c("numeric","character"))
# types are correct, but I got NA filled columns,
# is there an option like drop
# available for columns instead of rows?
print(casted)
print(casted[,lapply(.SD,typeof)])
# This drop looks ugly but I did not figure out a better way to perform it
casted[,names(casted)[unlist(casted[,lapply(lapply(.SD,is.na),all)])]:=NULL]
# I perform a LOCF, I do not know if I could benefit of
# data.table's roll option somehow and avoid
# the temporal memory copy of my dataset (this would be the second
# and minor issue)
casted<-na.locf(casted)
#option2
# taken from http://stackoverflow.com/questions/19253820/how-to-implement-coalesce-efficiently-in-r
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
casted2<-dcast(dt[,coalesce2(numeric,character),by=c("time","variable")],
time~variable,value.var="V1")
# There are not NA columns but types are incorrect
# it takes more space in a real table (more observations, less variables)
print(casted2)
print(casted2[,lapply(.SD,typeof)])
# Again, I am pretty sure there is a prettier way to do this
numericvars<-names(casted2)[!unlist(casted2[,lapply(
lapply(lapply(.SD,as.numeric),is.na),all)])]
casted2[,eval(numericvars):=lapply(.SD,as.numeric),.SDcols=numericvars]
# same as option 1, is there a data.table native way to do it?
casted2<-na.locf(casted2)
欢迎任何建议/改进过程。
答案 0 :(得分:2)
我可以分别做char和num表,然后再做rbind:
k = "time"
typecols = c("numeric", "character")
res = rbindlist(fill = TRUE,
lapply(typecols, function(tc){
cols = c(k, tc, "variable")
dt[!is.na(get(tc)), ..cols][, dcast(.SD, ... ~ variable, value.var=tc)]
})
)
setorderv(res, k)
res[, setdiff(names(res), k) := lapply(.SD, zoo::na.locf, na.rm = FALSE), .SDcols=!k]
给出了
time a b c d e f
1: 1 1 NA NA NA NA NA
2: 2 1 2 NA NA NA NA
3: 3 1 2 3 NA NA NA
4: 4 1 2 3 a NA NA
5: 5 1 2 3 a b NA
6: 6 1 2 3 a b c
请注意,OP的最终结果casted2
不同之处在于它将所有cols都设为char。