有没有办法自动选择正确的read.csv参数?

时间:2015-03-28 14:32:32

标签: r csv

我有一些文件格式稍有不同的CSV文件 - 比如可以添加额外的列(我不需要),标题可以在那里,时间格式可以是%Y-%m-%d %H:%M:%S或者%Y%m%d%H%M%S。有没有办法对csv文件进行一些预分析,然后为read.csv选择正确的参数?

例如,我有以下逻辑用于读取文件:

# file 1
dataft <- read.csv("file1.csv", header = TRUE, colClasses = c("factor", "factor", "factor", "factor", "integer", "factor", "integer", "factor", "factor", "factor", "integer", "factor", "factor"))
dataft[,"ddate"] = as.Date(dataft[,"ddate"],"%Y-%m-%d")

# file 2
datagae <- read.csv("file2.csv", header = FALSE, colClasses = c("factor", "factor", "factor", "factor", "integer", "factor", "integer", "factor", "factor", "factor", "integer", "factor", "factor"), col.names = c("col1", "col2", "ddate", "col4", "col5", "col6", "col7", "col8", "col9", "col10", "col11", "col12", "col13"))
datagae[,"ddate"] = as.Date(datagae[,"ddate"],"%Y%m%d")

# file 3 (with extra column, which I don't need - not sure how to skip it, NULL doesn't help)
datagae <- read.csv("file3.csv", header = FALSE, colClasses = c("factor", "factor", "factor", "factor", "integer", "factor", "integer", "factor", "factor", "factor", "integer", "factor", "factor", NULL), col.names = c("col1", "col2", "ddate", "col4", "col5", "col6", "col7", "col8", "col9", "col10", "col11", "col12", "col13", ""))
datagae[,"ddate"] = as.Date(datagae[,"ddate"],"%Y%m%d")

(所有数据框将在加载后合并)

UPD 即可。文件样本(可能的文件格式数量有限(且已知)!) -

# file 1
or,d,ddate,rdate,changes,class,price,fdate,company,number,minutes,added,source
VA1,VA2,2014-05-24,,0,0,2124,2014-05-22 15:50:16,,,,2014-05-22 12:20:03,tp
VA1,VA2,2014-05-26,,0,0,2124,2014-05-22 15:03:44,,,,2014-05-22 12:20:03,tp
VA1,VA2,2014-06-05,,0,0,2124,2014-05-22 15:48:24,,,,2014-05-22 12:20:03,tp
VA1,VA2,2014-06-09,,0,0,2124,2014-05-22 15:37:35,,,,2014-05-22 12:20:03,tp
VA1,VA2,2014-06-16,,0,0,2124,2014-05-22 14:17:33,,,,2014-05-22 12:20:03,tp

# file 2    
VA2,VA4,20140722,,0,3,6164,20140521121156,U1,U141,140,20140521121156,ms
VA3,VA5,20140701,,0,0,15176,20140521145035,S1,S1342,355,20140521145035,ms
VA3,VA6,20140710,,0,0,6676,20140521105118,S1,S1602,105,20140521105118,ms
VA2,VA7,20140729,,0,0,10023,20140521132150,U6,U100,230,20140521132150,ms
VA2,VA5,20140527,,0,0,13209,20140521145005,S7,S115,355,20140521145005,ms

# file 3
VA8,VA3,20140929,,0,0,14571,20140603163257,S1,S233,390,20140603163421,ms,4503623383908352
VA9,VA0,20140611,,0,0,13329,20140603171428,U6,U355,165,20140603171553,ms,4503639892688896
VA2,VA4,20140722,,0,3,6164,20140521121156,U1,U141,140,20140521121156,ms,4503659220041728
VA3,BAX,20140601,,0,0,14176,20140525101531,S1,S1430,250,20140525101608,ms,4503686600458240
VA3,REN,20140602,,0,0,10174,20140531213527,S1,S1244,121,20140531213653,ms,4503703511891968

# file 4   
or,added,key,source,price,d,av_s,type,number,company,class,changes,minutes,fdate,ddate,code
VA2,20140808T122044,VA2:VA9:20140808::0:0:14430:20140808122044,qE,14430,VA9,2,319,6156,S1,0,0,90,20140808T122044,20140808T192500,B
VA2,20140808T122044,VA2:VA9:20140808::0:0:19180:20140808122044,qE,19180,VA9,2,319,6182,S1,0,0,90,20140808T122044,20140808T222000,Y
VA2,20140808T122044,VA2:VA9:20140808::0:1:14866:20140808122044,qE,14866,VA9,1,319,41,S7,1,0,100,20140808T122044,20140808T203500,D
VA2,20140808T122045,VA2:VA9:20140808::0:1:35180:20140808122045,qE,35180,VA9,2,319,6146,S1,1,0,90,20140808T122045,20140808T171000,C
VA2,20140808T122044,VA2:VA9:20140809::0:0:3180:20140808122043,qE,3180,VA9,2,319,6186,S1,0,0,95,20140808T122043,20140809T232000,N

# file 5
data,key
"VA1,VA2,20140524,,0,0,5969,20140523134902,S7,S1147,140,20140523134902,m/t",4503632376496128
"VA2,VA3,20140711,,0,0,8824,20140601095714,S1,S6402,175,20140601095839,m/t",4503643113914368
"VA1,VA3,20140710,,0,0,11678,20140604085203,S1,S1430,250,20140604085329,m/t",4503666467799040
"VA2,VA1,20140724,,0,0,7109,20140523133835,S7,S793,130,20140523133835,m/t",4503679218483200
"VA3,VA1,20140925,,0,0,10592,20140604092548,S7,S109,395,20140604092714,m/t",4503694653521920

1 个答案:

答案 0 :(得分:0)

最后,我提出了以下解决方案:

fileCSV <- "file.csv"
conn <- file(fileCSV, "rt")
file <- readLines(conn, n = 1) # read only first line
fileFormat <- NULL
if (file[1]=="or,d,ddate,rdate,changes,class,price,fdate,company,number,minutes,added,source") {
    fileFormat <- 1

    data <- read.csv(fileCSV, header = TRUE, colClasses = c("factor", "factor", "factor", "factor", "integer", "factor", "integer", "factor", "factor", "factor", "integer", "factor", "factor"))
    data[,"ddate"] = as.Date(dataft[,"ddate"],"%Y-%m-%d")    
} else if (file[1]=="or,added,key,source,price,d,av_s,type,number,company,class,changes,minutes,fdate,ddate,code") {
    fileFormat <- 2

    # another approach to read the file
} else if (file[1]=="data,key") {
    fileFormat <- 3

    # third approach to read the file
} else if (grepl("[A-Z]{3},[A-Z]{3},20\\d{6},(20\\d{6})?,\\d,\\d,\\d+,20\\d{12}\\.*", file[1])[1] == TRUE && length(strsplit(file[1], ",")[[1]])==13) {
    fileFormat <- 4
} else if (grepl("[A-Z]{3},[A-Z]{3},20\\d{6},(20\\d{6})?,\\d,\\d,\\d+,20\\d{12}\\.*", file[1])[1] == TRUE && length(strsplit(file[1], ",")[[1]])==14) {
    fileFormat <- 5
}

分析第一行以识别文件内容,然后正确阅读。