我正在使用Spark 2.2和Scala 2.11来解析目录并在其中转换数据。
要处理ISO字符集,我正在使用hadoopFile像这样:
val inputDirPath = "myDirectory"
sc.hadoopFile[LongWritable, Text, TextInputFormat](inputDirPath).map(pair => new String(pair._2.getBytes, 0, pair._2.getLength, "iso-8859-1")).map(ProcessFunction(_)).toDF
如何将每一行的文件名添加到ProcessFunction中? ProcessFunction在参数中使用String并返回一个对象。
谢谢您的时间
答案 0 :(得分:2)
回答包括您的函数ProcessFunction
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.{FileSplit, TextInputFormat}
import org.apache.spark.rdd.HadoopRDD
val inputDirPath = "dataset.txt"
val textRdd = sc.hadoopFile[LongWritable, Text, TextInputFormat](inputDirPath)
// cast TO HadoopRDD
val linesWithFileNames = rddHadoop.mapPartitionsWithInputSplit((inputSplit, iterator) => {
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map(tuple => (file.getPath, new String(tuple._2.getBytes, 0, tuple._2.getLength, "iso-8859-1")))
}).map{case (path, line) => (path, ProcessFunction(line)}
答案 1 :(得分:1)
val textRdd = sc.hadoopFile[LongWritable, Text, TextInputFormat](inputDirPath)
// cast TO HadoopRDD
val linesWithFileNames = textRdd.asInstanceOf[HadoopRDD[LongWritable, Text]]
.mapPartitionsWithInputSplit((inputSplit, iterator) => {
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map(tuple => (file.getPath, tuple._2))
}
)
linesWithFileNames.foreach(println)