更改SparkContext textFile方法

时间:2016-10-27 06:46:27

标签: apache-spark

我尝试更改spark源代码中的textFile方法,以返回多行字符串的RDD而不是行字符串的rdd。我想在spark源代码中找到从磁盘读取文件内容的实现。

SparkConf sparkConf = new SparkConf().setAppName("MyJavaApp");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaRDD<String> lines = ctx.textFile(args[0], 1);

但是当我关注textFile调用链时,我只能访问HadoopRDDRDD类。 呼叫链如下:

在JavaSparkContext.scala

def textFile(path: String, minPartitions: Int): JavaRDD[String] =
sc.textFile(path, minPartitions)

并在SparkContext.scala

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
  minPartitions).map(pair => pair._2.toString).setName(path)}

def hadoopFile[K, V](path: String, ...): RDD[(K, V)] = {    
val confBroadcast = broadcast(new SerializableWritable(hadoopConfiguration))
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
new HadoopRDD(this,...).setName(path)  }

和HadoopRDD.scala

class HadoopRDD[K, V](
sc: SparkContext,
broadcastedConf: Broadcast[SerializableWritable[Configuration]],
initLocalJobConfFuncOpt: Option[JobConf => Unit],
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int) extends RDD[(K, V)](sc, Nil) with Logging {...

我不想使用地图功能(作为开销)从rdd行制作我自定义的rdd。

任何帮助?

0 个答案:

没有答案