我尝试更改spark源代码中的textFile
方法,以返回多行字符串的RDD而不是行字符串的rdd。我想在spark源代码中找到从磁盘读取文件内容的实现。
SparkConf sparkConf = new SparkConf().setAppName("MyJavaApp");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaRDD<String> lines = ctx.textFile(args[0], 1);
但是当我关注textFile
调用链时,我只能访问HadoopRDD
和RDD
类。
呼叫链如下:
在JavaSparkContext.scala
中def textFile(path: String, minPartitions: Int): JavaRDD[String] =
sc.textFile(path, minPartitions)
并在SparkContext.scala
中def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)}
和
def hadoopFile[K, V](path: String, ...): RDD[(K, V)] = {
val confBroadcast = broadcast(new SerializableWritable(hadoopConfiguration))
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
new HadoopRDD(this,...).setName(path) }
和HadoopRDD.scala
class HadoopRDD[K, V](
sc: SparkContext,
broadcastedConf: Broadcast[SerializableWritable[Configuration]],
initLocalJobConfFuncOpt: Option[JobConf => Unit],
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int) extends RDD[(K, V)](sc, Nil) with Logging {...
我不想使用地图功能(作为开销)从rdd行制作我自定义的rdd。
任何帮助?