fread在check.names之后应用select和colClasses?

时间:2018-05-31 01:34:01

标签: r data.table fread

我试图读取大量数据(最多100个文件,每个文件大小不超过1.5GB),这些文件略显烦人,每种数据都略有不同。出于速度原因,我想使用data.table::fread,但我遇到了许多问题:

  • 输入数据(我无法控制格式)有许多名称相同的列
  • ID列(可能还有其他)显示为数字,但实际上是字符或因子列 - 我需要保留前导0,因此我无法在导入后隐藏
  • 我只想要一些列,并且有这么多数据不希望导入然后解除
  • 每个文件的编号和名称列略有不同。使用正则表达式语句很容易找到我想要的列,并且我总是会得到相同的数量。

我的攻击计划是导入所有列并使用正则表达式找到合适的列,然后在select中的这些列上使用fread。但是现在我很难分配colClasses,因为这些是在选择列之前以及在检查名称之前分配的,因此即使使用命名列表也不起作用。有没有办法在colClasses / select之后应用check.names而不会丢失我的前导零?

我尝试了using colClasses in fread中的命名列技术并审核了Using colClasses and select arguments of fread simultaneously,但都没有处理我文件中的差异

可重复的例子:

dt <- data.frame(ID = c("01","02","03"), HH = 1:3, MM = rep(0,3), HH = 2:4, MM = rep(0,3),Precipx = rnorm(3),
             other1 = rep(0,3), other2 = rep(1,3),check.names = F)
write.csv(dt, "test.csv", row.names = F, quote = F)

Colnames <- names(fread("test.csv",nrows = 0 ,check.names = T))
ColNos <- grp(c("ID|HH.1|MM.1|$Precip"),Colnames)
#This import works, but I lose leading 0s
dat <- fread("test.csv", check.names = T, select = ColNos)

#This tells me I have the wrong number of `colClasses`, but I cannot set for all columns as varies file to file
dat <- fread("test.csv", check.names = T, select = ColNos, colClasses = c("character","charcter","character","numeric"))

#This doesn't recognise that I want the second HH column. Using just `"HH"` also has this problem
# and "Precipx" will sometimes be "Precipy", "Precipz"... in the file
dat<- fread("test.csv", check.names = T, select = ColNos, 
  colClasses = c("ID" = "character","HH.1" = "charcter","MM.1" = "character","Precipx" = "numeric"))

0 个答案:

没有答案