重塑,融化和在R中的大型数据帧上进行转换

时间:2016-08-11 15:22:21

标签: r memory dataframe casting melt

我有一堆数据帧,我想对使用包tidyr,reshape / reshape2进行一些更改。

Y      C        S      A    B_B_m  B_B_p  C_m  C_p  D_m  D_p 
2000 "AUSTRIA" "total" "no"  33      44   55   66   77   99
2001 "AUSTRIA" "total" "no"  22      11   0    23   24   25
2002 "AUSTRIA" "total" "no"  88      45   56   47   38   39
2003 "AUSTRIA" "total" "no"  90      48   67   67   69   74

应该来

       "C"    "Y"    "S"    "A"      "moment" "B_B" "C"  "D"
    "AUSTRIA" 2000 "total" "no"        "m"     33    55  77
    "AUSTRIA" 2000 "total" "no"        "p"     44    66  99
    "AUSTRIA" 2001 "total" "no"        "m"     22    0   24
    "AUSTRIA" 2001 "total" "no"        "p"     11    23  25
    "AUSTRIA" 2002 "total" "no"        "m"     88    56  38
    "AUSTRIA" 2002 "total" "no"        "p"     45    47  39
    "AUSTRIA" 2003 "total" "no"        "m"     90    67  69
    "AUSTRIA" 2003 "total" "no"        "p"     48    67  74

我使用以下代码来完成此任务:

setwd("C:\\...)
files = list.files(pattern="*.dta") #making a list for the files.
dflist <- list()
    for (i in 1:length(files)){                                  
      dflist[[i]] <- read.dta13(files[i], nonint.factors = TRUE)  
      dflist[[i]] <- melt(dflist[[i]], id=c("C","Y","S","A"))
      dflist[[i]] <- extract(dflist[[i]], variable, c('type', 'moment'), '^(.+)_([^_]+)$')
      dflist[[i]] <- cast(dflist[[i]],...~type)
    }

现在,此代码可用,但不适用于大型数据帧。我的一些数据帧有数百个(如果不是数千个)变量,并且使用这个代码我会一直耗尽内存或R只是崩溃。有什么想法吗?

编辑:

有人评论了有关ff包的内容,但删除了他们的评论。无论如何,我已经调查了这个包,但我似乎无法将数据帧读入R ...

我试过了:ffdfbig <- read.csv.ffdf(file="dfbig.csv") 但这给了我错误:

`Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  scan() expected 'an integer', got '"1001"'`

我也尝试使用colClasses参数:

sampleData <- read.csv("dfbig.csv", header = TRUE, nrows = 5)
    > classes <- sapply(sampleData, class)
    > ffdfbig <- read.csv.ffdf(file="dfbig.csv",header = TRUE, colClasses=classes)

并得到了同样的错误:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  scan() expected 'an integer', got '"1"'

:(

1 个答案:

答案 0 :(得分:1)

如果您的数据集较大,可以尝试使用ff包。 Here您可以找到一些有关如何使用它的示例。

另一种选择是使用data.table包,here您可以找到基本教程。

EDITED

好的,这是我到目前为止所拥有的。假设您的.csv文件包含您提供的示例数据:

Y,C,S,A,B_B_m,B_B_p,C_m,C_p,D_m,D_p
2000,"AUSTRIA","total","no",33,44,55,66,77,99
2001,"AUSTRIA","total","no",22,11,0,23,24,25
2002,"AUSTRIA","total","no",88,45,56,47,38,39
2003,"AUSTRIA","total","no",90,48,67,67,69,74

你可以使用ff包读取文件:

library(ff)
library(ffbase)
library(reshape2)

ffdfbig <- read.csv.ffdf(file="/path/to/your/file/dataFile.csv", 
                         colClasses=c("numeric", rep("factor", 3), rep("numeric", 6)), 
                         header = T)

你说你在读取整数时遇到了麻烦(当从.csv推断出它们时),我能够通过显式传递列类来读取文件并生成ffdf对象。获得ffdf对象后,您可以使用以下方法生成重塑过程的第一部分:

res <- ffdfdply(x=ffdfbig, split=ffdfbig$Y, FUN=function(x){
  df <- reshape(x, 
                v.names = "value", 
                varying = c("B_B_m", "B_B_p", "C_m", "C_p", "D_m", "D_p"),
                timevar = "variable",
                times = c("B_B_m", "B_B_p", "C_m", "C_p", "D_m", "D_p"),
                direction = "long")
  as.data.frame(df)
})

我不知道如何将函数应用于ffdf包,但是this回答给了我关键。

上述代码的结果如下:

"Y","C","S","A","variable","value","id"
2000,"AUSTRIA","total","no","B_B_m",33,1
2001,"AUSTRIA","total","no","B_B_m",22,2
2002,"AUSTRIA","total","no","B_B_m",88,3
2003,"AUSTRIA","total","no","B_B_m",90,4
2000,"AUSTRIA","total","no","B_B_p",44,1
2001,"AUSTRIA","total","no","B_B_p",11,2
2002,"AUSTRIA","total","no","B_B_p",45,3
2003,"AUSTRIA","total","no","B_B_p",48,4

最后,&#34;分裂&#34; &#34; m&#34; s&#34; p&#34;以及广播的流程:

res <- ffdfdply(x=res, split = res$Y, FUN = function(y){
  vars <- c("prefix", "moment")
  df <- extract(y, variable, c('type', 'moment'), '^(.+)_([^_]+)$')
})

res <- ffdfdply(x = res, split = res$Y, FUN = function(x){
  df <- dcast(x, ...~type)
})

res$id <- NULL

如果您想再次将其写入.csv,可以使用此功能:

write.csv.ffdf(res, "final.csv")

哪会产生以下csv

"","Y","C","S","A","moment","B_B","C.1","D"
"1",2000,"AUSTRIA","total","no","m",33,55,77
"2",2000,"AUSTRIA","total","no","p",44,66,99
"3",2001,"AUSTRIA","total","no","m",22,0,24
"4",2001,"AUSTRIA","total","no","p",11,23,25
"5",2002,"AUSTRIA","total","no","m",88,56,38
"6",2002,"AUSTRIA","total","no","p",45,47,39
"7",2003,"AUSTRIA","total","no","m",90,67,69
"8",2003,"AUSTRIA","total","no","p",48,67,74

您可以使用所有大csv来尝试这些函数,看它是否会导致内存异常。我希望这会有所帮助。