Question

我有一个包含22268行BY 2521列的文件。当我尝试使用以下代码行读入文件时：

file <- read.table(textfile, skip=2, header=TRUE, sep="\t", fill=TRUE, blank.lines.skip=FALSE)

但我只读入了13024行BY 2521列，并出现以下错误：

Warning message: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : number of items read is not a multiple of the number of columns

我还使用此命令查看哪些行的列数不正确：

x <-count.fields(textfile, sep="\t", skip=2)
incorrect <- which(x != 2521)

并找回了大约20行不正确的列表。

有没有办法用NA值填充这些行？

我认为这是read.table函数中“fill”参数的作用，但它并没有出现。

OR

有没有办法忽略“不正确”变量中标识的这些行？

Answer 1

您可以使用readLines()输入数据，然后查找有问题的行。

    con <- file("path/to/file.csv", "rb")
    rawContent <- readLines(con) # empty
    close(con)  # close the connection to the file, to keep things tidy

然后看看rawContent

查找列数不正确的行，例如：

    expectedColumns <- 2521
    delim <- "\t"

    indxToOffenders <-
    sapply(rawContent, function(x)   # for each line in rawContent
        length(gregexpr(delim, x)[[1]]) != expectedColumns   # count the number of delims and compare that number to expectedColumns
    )

然后阅读您的数据：

  myDataFrame <- read.csv(rawContent[-indxToOffenders], header=??, sep=delim)

读取文件 - 警告消息

1 个答案: