如何使用Spark hadoopFile方法使用值类型为Text的自定义输入格式?

时间:2019-05-14 09:28:10

标签: scala apache-spark hadoop

如何使用Spark hadoopFile方法使用值类型为Text的自定义输入格式?例如OmnitureDataFileInputFormat用于处理Omniture Click Stream数据吗?

1 个答案:

答案 0 :(得分:0)

import org.rassee.omniture.hadoop.mapred.OmnitureDataFileInputFormat
import java.nio.charset.StandardCharsets
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapred.InputFormat

val rddLines: RDD[String] =
  sparkSession.sparkContext.hadoopFile(
    path = path,
    inputFormatClass = classOf[OmnitureDataFileInputFormat],
    keyClass = classOf[LongWritable],
    valueClass = classOf[Text]
  )
  .map(_._2.copyBytes()).map(new String(_, StandardCharsets.UTF_8))