在输出文件中以特定格式保存RDD对

时间:2017-07-30 10:24:00

标签: apache-spark apache-spark-2.0

我有一个JavaPairRDD可以说类型为

的数据
<Integer,List<Integer>>

当我做data.saveAsTextFile(“输出”)时 输出将包含以下格式的数据:

(1,[1,2,3,4])

等...

我想在输出文件中输入这样的内容:

1 1,2,3,4

i.e. 1\t1,2,3,4

任何帮助将不胜感激

1 个答案:

答案 0 :(得分:3)

You need to understand what's happening here. You have an RDD[T,U] where T and U are some obj types, read it as RDD of Tuple of T and U. On this RDD when you call saveAsTextFile(), it essentially converts each element of RDD to string, hence the text file is generated as output.

Now, how is an object of some type T converted to a string? By calling the toString() on it. This is the reason why you have [] representing the List, and () representing the Tuple as whole.

Solution, map each element in your RDD to a string as per your format. I'm not that familiar with the Java Syntax but with Scala I'll do something like,

rdd.map(e=>s"${e._1}\t${e._2.mkString(",")}")

Where mkString concatenates a collection using some delimiter.

Let me know if this helped. Cheers.