Question

我试图将以下csv文件读入R

http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv

我目前使用的代码是：

url <- "http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv"
shorthistory <- read.csv(url, skip = 4)

但是我一直收到以下错误。

1：在readLines（文件，跳过）中：第1行似乎包含嵌入的nul
      2：在readLines（文件，跳过）中：第2行似乎包含嵌入的nul
      3：在readLines（文件，跳过）中：第3行似乎包含嵌入式nul
      4：在readLines（文件，跳过）中：第4行似乎包含嵌入的nul

这让我相信我正在错误地利用这个功能，因为每一行都失败了。

非常感谢任何帮助！

Answer 1

由于左上角有空白，read.csv()似乎无法正常工作。必须逐行读取文件（readLines()），然后跳过前4行。

下面显示了一个示例。该文件作为文件连接（file()）打开，然后逐行读取（readLines()）。通过子集化跳过前4行。该文件以制表符分隔，以便递归应用strsplit()。它们仍然保留为字符串列表，它们应该重新格式化为数据框或任何其他合适的类型。

# open file connection and read lines
path <- "http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv"
con <- file(path, open = "rt", raw = TRUE)
text <- readLines(con, skipNul = TRUE)
close(con)

# skip first 4 lines
text <- text[5:length(text)]
# recursively split string
text <- do.call(c, lapply(text, strsplit, split = "\t"))

text[[1]][1:4]
# [1] "1-PAGE LTD ORDINARY" "1PG "                "1330487"             "1.72"

Answer 2

我最终没有尝试使用readlines，但事实证明该文件是unicode ....是的，文件格式很糟糕，但是使用以下代码结束只获取短片的体积数据。

  shorthistory <- read.csv("http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv",skip=1,fileEncoding = "UTF-16",sep = "\t")
  shorthistory <- shorthistory[-(1:2),]
  shorthistory <- cbind(Row.Names = rownames(shorthistory), shorthistory)
  rownames(shorthistory) <- NULL
  colnames(shorthistory) <- substr(colnames(shorthistory),2,11)
  colnames(shorthistory)[1] <- "Company"
  colnames(shorthistory)[2] <- "Ticker"
  shorthist1 <- shorthistory[,1:2]
  i=3 ##start at first volume column with short data
  while(i<=length(colnames(shorthistory))){
    if(i%%2 == 0){
      shorthist1 <- cbind(shorthist1,shorthistory[i])
      i <- i+1
      }
    else{
      i <- i+1
    }
  }
  melted <- melt(data = shorthist1,id = c("Ticker","Company"))
  melted$variable <- as.POSIXlt(x = melted$variable,format = "%Y.%m.%d")
  melted$value[melted$value==""] <- 0.00

Answer 3

在包含BOM（字节顺序标记）和NUL的CSV文件出现很多问题之后，我编写了这个小功能。它逐行读取文件（忽略NUL），跳过空行，然后应用read.csv。

# Read CSV files with BOM and NUL problems
read.csvX = function(file, encoding="UTF-16LE", header=T, stringsAsFactors=T) {
  csvLines = readLines(file, encoding=encoding, skipNul=T, warn=F)
  # Remove BOM (ÿþ) from first line
  if (substr(csvLines[[1]], 1, 2) == "ÿþ") {
    csvLines[[1]] = substr(csvLines[[1]], 3, nchar(csvLines[[1]]))
  }
  csvLines = csvLines[csvLines != ""]
  if (length(csvLines) == 0) {
    warning("Empty file")
    return(NULL)
  }
  csvData = read.csv(text=paste(csvLines, collapse="\n"), header=header, stringsAsFactors=stringsAsFactors)
  return(csvData)
}

希望这个旧问题的答案对某人有帮助。

将非标准CSV文件读入R中

3 个答案: