我有一个groupedRDD
类型key = String
和value = Iterable<String>
值实际上是以json
格式保存String
数据,而分组键的格式为<tenant_id>/<year>/<month>
我想根据密钥名称将此rdd保存到hdfs,每个密钥名称只应有一个输出文件
示例:如果我的分组rdd中有以下键
tenant1/2016/12/output_data.json
tenant1/2017/01/output_data.json
tenant1/2017/02/output_data.json
然后在我的HDFS中我应该有三个文件
tenant1/2016/12/output_data.json
tenant1/2017/01/output_data.json
tenant1/2017/02/output_data.json
要做到这一点,我尝试了以下内容:
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any = NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = key.asInstanceOf[String]
}
groupedRDD.partitionBy(new HashPartitioner(1))
.saveAsHadoopFile("/user/pkhode/output/", classOf[String], classOf[String], classOf[RDDMultipleTextOutputFormat])
这给出了预期的输出文件数
/user/pkhode/output/tenant1/2016/12/output_data.json
/user/pkhode/output/tenant1/2017/01/output_data.json
/user/pkhode/output/tenant1/2017/02/output_data.json
但是这些文件中的数据应该是每行中json数据类型的字符串。但结果是这样的事情
List({json_object_in_string1}, {json_object_in_string2}, .....)
预期结果是
{json_object_in_string1}
{json_object_in_string2}
.....
有人可以指出我,我怎样才能做到这一点?
感谢@Tim P,我已将我的代码更新为以下
groupedRDD.partitionBy(new HashPartitioner(1000)).mapValues(_.mkString("\n")).saveAsHadoopFile(outputPath, classOf[String], classOf[String], classOf[RDDMultipleTextOutputFormat])
这个解决方案可以正常工作,因为对于较小的数据大小,但是当我尝试使用大约20GB的输入数据集时,它在mapValue
阶段给出了跟随错误
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:181)
at com.esotericsoftware.kryo.io.Output.require(Output.java:160)
at com.esotericsoftware.kryo.io.Output.writeString_slow(Output.java:462)
at com.esotericsoftware.kryo.io.Output.writeString(Output.java:363)
at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:191)
at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:184)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27)
at scala.collection.immutable.List.foreach(List.scala:381)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:27)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:21)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:195)
at org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:135)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
答案 0 :(得分:0)
当Spark将RDD保存为文本文件时,它只调用RDD元素上的toString
。首先尝试将值映射到String
:
rdd.mapValues(_.mkString("\n"))
答案 1 :(得分:0)
我没有使用RDD
而是将我的RDD转换为PairedRDD
,如下所示:
val resultRDD = inputRDD.map(row => {
val gson = new GsonBuilder().serializeNulls.create
val data = gson.toJson(row)
val fileURL = s"${row.getTenantId}/${row.getYear}/${row.getMonth}/output_data.json"
(fileURL, data)
})
然后调用saveAsHadoopFile
将结果保存到单个文件中,如下所示:
class RddMultiTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any = NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = key.asInstanceOf[String]
}
resultRDD.partitionBy(new HashPartitioner(1000)).saveAsHadoopFile(outputPath, classOf[String], classOf[String], classOf[RddMultiTextOutputFormat])