Question

我正在将一个较大的（> 5GB）csv文件读入R。csv文件以UTF-16格式写入。

大多数可以有效处理大文件读取（fread，read_delim）的功能不适用于UTF-16。

read.csv允许您定义编码，并且将以UTF-16读取，这是在这些非常大的文件上处理缓慢的问题。

我正在使其与read.csv（）一起使用，请参见下面的代码。但是我很好奇是否有人知道在R中读取UTF-16数据的更有效方法。

### iterative process using read.csv https://stackoverflow.com/questions/9352887/strategies-for-reading-in-csv-files-in-pieces/30403877#30403877
# establishing a connection to the file  
con <- file(csvPath, "r", encoding = 'UTF-16')
#close(con)
# create a dataframe to bind outputs to 
df2 <- data.frame()

rows <- 10000
x =1 
while(rows ==10000){
  df <- read.csv(con,header = FALSE,fileEncoding = 'UTF-16',sep = '\t',nrows = 10000)
  rows <- nrow(df)
  print(rows)
  colnames(df) <- names(header)
  dataThin <- df %>%
    dplyr::select("gbifID", "genus", "species", "infraspecificEpithet", "taxonRank",
                  "countryCode", "locality", "stateProvince", "decimalLatitude", 
                  "decimalLongitude", "basisOfRecord", "institutionCode" )
  df2 <- rbind(df2, dataThin)
  x = x+1 
  print(x)
}

R中以UTF-16格式读取csv的函数

0 个答案: