我想将数据(data.frame)从长格式转换为宽格式,并将“ITEM”的值作为列和值(“ITEM2”)(见下文):
长格式:
宽幅格式:
因此我使用包reshape2中的dcast函数:
df <= dcast(df,SEQUENCEID + EVENTID ~ ITEM, value.var="ITEM2")
这样做一切正常。但是在我的数据框架中有7m的数据记录,我在内存限制方面遇到了困难。因此我决定使用包plyr中的ddply。
为了确保每个拆分具有相同顺序的相同列,我提前从“ITEM”中提取值,如果不存在则附加N / A列,并按字母顺序排列所有列。
整个代码下方:
#Example data
lo_raw <- data.frame(SEQUENCEID=rep(1546842, 10),
EVENTID=c(5468503146,5468503146,5468503146,5468503147,5468503147,5468503148,5468503148,5468503148,5468503148,5468503148),
ITEM =c("cakes","limonade","coffee","coffee","juice","limonade","cakes","water","fruits","vegetable"),
ITEM2=c("cakes","limonade","coffee","coffee","juice","limonade","cakes","water","fruits","vegetable"),
SPLIT=rep(1547000, 10))
#Extract items
item <- as.character(unique(lo_raw$ITEM))
#Function dcast
castff <- function(df,item){
df = dcast(df, SEQUENCEID + EVENTID ~ ITEM, value.var="ITEM2")
for(i in item){
if (!(i %in% colnames(df))){
df[,i] <- NA
}
}
df <- df[,c(1,2,(order(colnames(df[,3:dim(df)[2]])))+2)]
df
}
#Apply dcast
df_pivot <- ddply(lo_raw, .(SPLIT), .fun=function(lo_raw,item) castff(lo_raw,item), item=item, .progress="text", .inform=TRUE)
执行ddply
时,已使用的RAM在runtim上增加,直到达到最大值(12 GB)。因此表现非常缓慢,我在几个小时后终止了R.
是否有另一种方法来投射整个数据集?
提前致谢。