修改

Question

我想通过

读取非UTF-8编码的全文文件

val df = spark.sparkContext.wholeTextFiles(path, 12).toDF

进入火花。如何更改编码？我想读取ISO-8859编码的文本，但它不是CSV，它类似于xml：SGML。

修改

可能应该使用自定义Hadoop文件输入格式吗？

Answer 1

您可以使用SparkContext.binaryFiles()来阅读文件，并为指定所需字符集的内容构建String。 E.g：

val df = spark.sparkContext.binaryFiles(path, 12)
  .mapValues(content => new String(content.toArray(), StandardCharsets.ISO_8859_1))
  .toDF

Answer 2

很简单。

这是源代码，

import java.nio.charset.Charset

import org.apache.hadoop.io.{Text, LongWritable}
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD

object TextFile {
  val DEFAULT_CHARSET = Charset.forName("UTF-8")

  def withCharset(context: SparkContext, location: String, charset: String): RDD[String] = {
    if (Charset.forName(charset) == DEFAULT_CHARSET) {
      context.textFile(location)
    } else {
      // can't pass a Charset object here cause its not serializable
      // TODO: maybe use mapPartitions instead?
      context.hadoopFile[LongWritable, Text, TextInputFormat](location).map(
        pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)
      )
    }
  }
}

从这里开始复制。

https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala

使用它。

https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/util/TextFileSuite.scala

修改

如果您需要全文文件，

以下是实施的实际来源。

def wholeTextFiles( path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope { assertNotStopped() val job = NewHadoopJob.getInstance(hadoopConfiguration) // Use setInputPaths so that wholeTextFiles aligns with hadoopFile/textFile in taking // comma separated files as input. (see SPARK-7155) NewFileInputFormat.setInputPaths(job, path) val updateConf = job.getConfiguration new WholeTextFileRDD( this, classOf[WholeTextFileInputFormat], classOf[Text], classOf[Text], updateConf, minPartitions).map(record => (record._1.toString, record._2.toString)).setName(path) }

尝试更改：

.map(record => (record._1.toString, record._2.toString))

到（可能）：

.map(record => (record._1.toString, new String(record._2.getBytes, 0, record._2.getLength, "myCustomCharset")))

spark读取具有非UTF-8编码的wholeTextFiles

修改

2 个答案: