我试图读取大量数据(最多100个文件,每个文件大小不超过1.5GB),这些文件略显烦人,每种数据都略有不同。出于速度原因,我想使用data.table::fread
,但我遇到了许多问题:
我的攻击计划是导入所有列并使用正则表达式找到合适的列,然后在select
中的这些列上使用fread
。但是现在我很难分配colClasses
,因为这些是在选择列之前以及在检查名称之前分配的,因此即使使用命名列表也不起作用。有没有办法在colClasses
/ select
之后应用check.names
而不会丢失我的前导零?
我尝试了using colClasses in fread中的命名列技术并审核了Using colClasses and select arguments of fread simultaneously,但都没有处理我文件中的差异
可重复的例子:
dt <- data.frame(ID = c("01","02","03"), HH = 1:3, MM = rep(0,3), HH = 2:4, MM = rep(0,3),Precipx = rnorm(3),
other1 = rep(0,3), other2 = rep(1,3),check.names = F)
write.csv(dt, "test.csv", row.names = F, quote = F)
Colnames <- names(fread("test.csv",nrows = 0 ,check.names = T))
ColNos <- grp(c("ID|HH.1|MM.1|$Precip"),Colnames)
#This import works, but I lose leading 0s
dat <- fread("test.csv", check.names = T, select = ColNos)
#This tells me I have the wrong number of `colClasses`, but I cannot set for all columns as varies file to file
dat <- fread("test.csv", check.names = T, select = ColNos, colClasses = c("character","charcter","character","numeric"))
#This doesn't recognise that I want the second HH column. Using just `"HH"` also has this problem
# and "Precipx" will sometimes be "Precipy", "Precipz"... in the file
dat<- fread("test.csv", check.names = T, select = ColNos,
colClasses = c("ID" = "character","HH.1" = "charcter","MM.1" = "character","Precipx" = "numeric"))