Spark读/写(csv)ISO-8859-1

时间:2016-08-24 18:24:07

标签: scala apache-spark utf-16

我需要读取iso-8859-1编码文件,执行一些操作然后保存(使用iso-8859-1编码)。为了测试这个,我迷失了我在Databricks CSV包上找到的测试用例: https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scala

- 具体来说:test(“iso-8859-1编码文件的DSL测试”)

val fileDF = spark.read.format("com.databricks.spark.csv")
  .option("header", "false")
  .option("charset", "iso-8859-1")
  .option("delimiter", "~")         // bogus - hopefully something not in the file, just want 1 record per line
  .load("s3://.../cars_iso-8859-1.csv")

   fileDF.collect                   // I see the non-ascii characters correctly
val selectedData = fileDF.select("_c0")  // just so show an operation
selectedData.write
  .format("com.databricks.spark.csv")
  .option("header", "false")
  .option("delimiter", "~")
  .option("charset", "iso-8859-1")
  .save("s3://.../carOutput8859")

此代码运行时没有错误 - 但它似乎不符合输出上的iso-8859-1选项。在Linux提示符下(从S3复制后 - >本地Linux)

file -i cars_iso-8859-1.csv 
cars_iso-8859-1.csv: text/plain; charset=iso-8859-1

file -i carOutput8859.csv 
carOutput8859.csv: text/plain; charset=utf-8

我只是在寻找一些阅读和编写非UTF8文件的好例子。此时,我在方法上有很大的灵活性。 (不一定是csv读者)任何推荐/例子?

0 个答案:

没有答案