Question

我试图使用ff包在大型（370万行，180列）数据集中读入R。数据集中有几种数据类型 - 因子，逻辑和数字。

问题在于读取数值变量。例如，我的一个列是：

TotalBeforeTax
126.9
88.0
124.5
90.9
...

当我尝试读取数据时，会抛出以下错误：

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got '"126.90000"'

我尝试使用integer参数将类声明为numeric（它已经声明为colClasses），但无济于事。我也尝试将其更改为a real（无论应该是什么意思），它开始读取数据，但在某些时候抛出：

Error in methods::as(data[[i]], colClasses[i]) : 
  no method or default for coercing “character” to “a real”

（我的猜测是，因为它遇到了NA，并且不知道如何处理它。）

有趣的是，如果我将列声明为factor，那么所有内容都会很好地读取。

是什么给出了？

Answer 1

好的，所以我设法使用原始的解决方法来解决这个问题。首先，使用csv文件拆分器应用程序拆分.csv文件。然后，执行以下代码：

## First, set the folder where the split .csv files are. Set the file names.

sourceDir <- "split_files_folder"
sourceFile <- paste(sourceDir,"common_name_of_split_files", sep = "/")

## Now set the number of split pieces.

pieces <- "some_number"

## Set the destination folder for the tab-delimited text files. 
## Set the output file name.

destDir <- "destination_folder"
destFile <- paste(paste(destDir, "datafile", sep = "/"), "txt", sep = ".")

## Now, initialize the loop.

for (i in 1:pieces)
{
  temp <- read.csv(file = paste(paste(sourceFile, i, sep = "_"), "csv", sep = "."))
  if (i == 1) 
  {
    write.table(temp, file = destFile, quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)
  }
  else 
  {
    write.table(temp, file = destFile, append = TRUE, quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
  }
}

瞧！你有一个巨大的制表符分隔的文本文件！

Answer 2

解决方案1 

您可以尝试使用laf_to_ffdf包中的ffbase。类似的东西：

library(LaF)
library(ffbase)

con <- laf_open_csv("yourcsvfile.csv", 
  column_names = [as character vector with column names], 
  column_types = [a character vector with colClasses], 
  dec=".", sep=",", skip=1)

ffdf <- laf_to_ffdf(con)

或者如果您想自动检测类型：

library(LaF)
library(ffbase)

m <- detect_dm_csv("yourcsvfile.csv")
con <- laf_open(m)
ffdf <- laf_to_ffdf(con)

解决方案2

对违规列使用character列列，并将列转换为transFUN read.csv.ffdf参数中的数字：

ffdf <- read.csv.ffdf([your regular arguments], transFUN = function(d) {
  d$offendingcolumn <- as.numeric(d$offendingcolumn)
  d
})

Answer 3

问题似乎是被引号＆＃34;包围的数字126.9000。所以也许你应该首先将变量作为字符，然后删除所有不需要的字符，最后将变量转换为数字。

使用read.csv.ffdf（）会引发错误

3 个答案: