将非标准CSV文件读入R中

时间:2015-05-15 04:33:52

标签: r csv import-from-csv

我试图将以下csv文件读入R

http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv

我目前使用的代码是:

url <- "http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv"
shorthistory <- read.csv(url, skip = 4)

但是我一直收到以下错误。

  

1:在readLines(文件,跳过)中:第1行似乎包含嵌入的nul
      2:在readLines(文件,跳过)中:第2行似乎包含嵌入的nul
      3:在readLines(文件,跳过)中:第3行似乎包含嵌入式nul
      4:在readLines(文件,跳过)中:第4行似乎包含嵌入的nul

这让我相信我正在错误地利用这个功能,因为每一行都失败了。

非常感谢任何帮助!

3 个答案:

答案 0 :(得分:1)

由于左上角有空白,read.csv()似乎无法正常工作。必须逐行读取文件(readLines()),然后跳过前4行。

下面显示了一个示例。该文件作为文件连接(file())打开,然后逐行读取(readLines())。通过子集化跳过前4行。该文件以制表符分隔,以便递归应用strsplit()。它们仍然保留为字符串列表,它们应该重新格式化为数据框或任何其他合适的类型。

# open file connection and read lines
path <- "http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv"
con <- file(path, open = "rt", raw = TRUE)
text <- readLines(con, skipNul = TRUE)
close(con)

# skip first 4 lines
text <- text[5:length(text)]
# recursively split string
text <- do.call(c, lapply(text, strsplit, split = "\t"))

text[[1]][1:4]
# [1] "1-PAGE LTD ORDINARY" "1PG "                "1330487"             "1.72"

答案 1 :(得分:0)

我最终没有尝试使用readlines,但事实证明该文件是unicode ....是的,文件格式很糟糕,但是使用以下代码结束只获取短片的体积数据。

  shorthistory <- read.csv("http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv",skip=1,fileEncoding = "UTF-16",sep = "\t")
  shorthistory <- shorthistory[-(1:2),]
  shorthistory <- cbind(Row.Names = rownames(shorthistory), shorthistory)
  rownames(shorthistory) <- NULL
  colnames(shorthistory) <- substr(colnames(shorthistory),2,11)
  colnames(shorthistory)[1] <- "Company"
  colnames(shorthistory)[2] <- "Ticker"
  shorthist1 <- shorthistory[,1:2]
  i=3 ##start at first volume column with short data
  while(i<=length(colnames(shorthistory))){
    if(i%%2 == 0){
      shorthist1 <- cbind(shorthist1,shorthistory[i])
      i <- i+1
      }
    else{
      i <- i+1
    }
  }
  melted <- melt(data = shorthist1,id = c("Ticker","Company"))
  melted$variable <- as.POSIXlt(x = melted$variable,format = "%Y.%m.%d")
  melted$value[melted$value==""] <- 0.00

答案 2 :(得分:0)

在包含BOM(字节顺序标记)和NUL的CSV文件出现很多问题之后,我编写了这个小功能。它逐行读取文件(忽略NUL),跳过空行,然后应用read.csv

# Read CSV files with BOM and NUL problems
read.csvX = function(file, encoding="UTF-16LE", header=T, stringsAsFactors=T) {
  csvLines = readLines(file, encoding=encoding, skipNul=T, warn=F)
  # Remove BOM (ÿþ) from first line
  if (substr(csvLines[[1]], 1, 2) == "ÿþ") {
    csvLines[[1]] = substr(csvLines[[1]], 3, nchar(csvLines[[1]]))
  }
  csvLines = csvLines[csvLines != ""]
  if (length(csvLines) == 0) {
    warning("Empty file")
    return(NULL)
  }
  csvData = read.csv(text=paste(csvLines, collapse="\n"), header=header, stringsAsFactors=stringsAsFactors)
  return(csvData)
}

希望这个旧问题的答案对某人有帮助。