Question

我正在尝试将大型（> 70 MB）固定格式的文本文件输入到r中。对于较小的文件（<1MB），我可以使用read.fwf（）函数，如下所示。

condodattest1a <- read.fwf(impfile1,widths=testcsv3$Varlen,col.names=testcsv3$Varname)

当我尝试运行下面的代码行时，

condodattest1 <- read.fwf(impfile,widths=testcsv3$Varlen,col.names=testcsv3$Varname)

我收到以下错误消息：

错误：无法分配大小为2 Kb的矢量

两行之间的唯一区别是输入文件的大小。

我要导入的文件的格式在名为testcsv3的数据框中给出。我在下面显示了一小段数据框：

> head(testcsv3)

  Varlen      Varname    Varclass Varsep Varforfmt
1      2         "V1" "character"      2    "A2.0"
2     15         "V2" "character"     17   "A15.0"
3     28         "V3" "character"     45   "A28.0"
4      3         "V4" "character"     48    "F3.0"
5      1         "V5" "character"     49    "A1.0"
6      3         "V6" "character"     52    "A3.0"

至少我的一部分问题是，当我使用read.fwf（）并且最终超出计算机的内存限制时，我正在读取所有数据中的因素。

我尝试使用read.table（）作为格式化每个变量的方法，但似乎我需要一个带有该函数的文本分隔符。在下面的链接的第3.3节中有一个建议，我可以使用sep来识别每个变量开始的列。

http://data.princeton.edu/R/readingData.html

但是，当我使用以下命令时：

condodattest1b <- read.table(impfile1,sep=testcsv3$Varsep,col.names=testcsv3$Varname, colClasses=testcsv3$Varclass)

我收到以下错误消息：

read.table出错（impfile1，sep = testcsv3 $ Varsep，col.names = testcsv3 $ Varname，：'sep'参数无效

最后，我尝试使用：

condodattest1c <- read.fortran(impfile1,lengths=testcsv3$Varlen, format=testcsv3$Varforfmt, col.names=testcsv3$Varname)

但我收到以下消息：

Error in processFormat(format) : missing lengths for some fields
In addition: Warning messages:
1: In processFormat(format) : NAs introduced by coercion
2: In processFormat(format) : NAs introduced by coercion
3: In processFormat(format) : NAs introduced by coercion

此时我想做的就是将数据作为除因素之外的其他内容进行格式化。我希望这会限制我使用的内存量，并允许我实际输入文件。我很感激有关如何做到这一点的任何建议。我知道所有变量的Fortran格式以及每个变量开始的列。

谢谢，

沃伦

Answer 1

也许这段代码适合你。您必须使用字段大小填充varlen并将相应的类型字符串（例如数字，字符，整数）添加到colclasses

my.readfwf <- function(filename,varlen,colclasses) {
  sidx <- cumsum(c(1,varlen[1:(length(varlen)-1)]))
  eidx <- sidx+varlen-1
  filecontent <- scan(filename,character(0),sep="\n")
  if (any(diff(nchar(filecontent))!=0))
    stop("line lengths differ!")
  nlines <- length(filecontent)
  res <- list()
  for (i in seq_along(varlen)) {
    res[[i]] <- sapply(filecontent,substring,first=sidx[i],last=eidx[i])
    mode(res[[i]]) <- colclasses[i]
  }
  attributes(res) <- list(names=paste("V",seq_along(res),sep=""),row.names=seq_along(res[[1]]),class="data.frame")
  return(res)
}

在r中读取大的固定格式文本文件

1 个答案: