Question

我有大的.csv文件，由制表符分隔，与colClasses = c("integer", "integer", "numeric")具有严格的结构。出于某种原因，有许多垃圾无关的字符行，这打破了模式，这就是我得到

的原因

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'an integer', got 'ExecutiveProducers'

如何让read.table继续并跳过这一行？文件很大，因此手动执行任务很麻烦。如果不可能，我应该使用scan + for-loop吗？

现在我只是将所有内容都读作字符，然后删除不相关的行并将列转换回数字，我认为这不是很节省内存的

Answer 1

如果您的文件适合内存，您可以先读取文件，删除不需要的行，然后使用read.csv读取这些行：

lines <- readLines("yourfile")

# remove unwanted lines: select only lines that do not contain 
# characters; assuming you have column titles in the first line,
# you want to add those back again; hence the c(1, sel)
sel <- grep("[[:alpha:]]", lines, invert=TRUE)
lines <- lines[c(1,sel)]

# read data from selected lines
con <- textConnection(lines)
data <- read.csv(file=con, [other arguments as normal])

Answer 2

如果字符串始终相同，或者始终包含相同的单词，则可以使用

将它们定义为NA值

  read.csv(...,  na.strings="")

然后用

删除所有这些内容

omit.na(dataframe)

Read.table跳过有错误的行

2 个答案: