我的最终RDD看起来像这样
FinalRDD.collect()
Array[(Int, Seq[Iterable[Int]])] = Array((1,List(List(97), List(98), List(99), List(100))), (2,List(List(97, 98), List(97, 99), List(97, 101))), (3,List(List(97, 98, 99),List(99, 102, 103))))
我想以下列格式将此RDD写入文本文件
('97'), ('98'), ('100')
('97', '98'), ('97', '99'), List(97, 101)
('97','98', '99'), ('97', '99', '101')
我发现许多网站建议将java.io中的PrintWriter类作为实现此目的的一个选项。这是我尝试过的代码。
val writer = new PrintWriter(new File(outputFName))
def writefunc(chunk : Seq[Iterable[Int]])
{
var n=chunk
print("inside write func")
for(i <- 0 until n.length)
{
writer.print("('"+n(i)+"')"+", ")
}
}
finalRDD.mapValues(list =>writefunc(list)).collect()
我最终得到如下所示的任务可序列错误
finalRDD.mapValues(list =>writefunc(list)).collect()
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1.apply(PairRDDFunctions.scala:758)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1.apply(PairRDDFunctions.scala:757)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.mapValues(PairRDDFunctions.scala:757)
... 50 elided
Caused by: java.io.NotSerializableException: java.io.PrintWriter
Serialization stack:
- object not serializable (class: java.io.PrintWriter, value: java.io.PrintWriter@b0c0abe)
- field (class: $iw, name: writer, type: class java.io.PrintWriter)
- object (class $iw, $iw@31afbb30)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@672ca5ae)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@528ac6dd)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@b772a0e)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@7b11bb43)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@94c2342)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@2bacf377)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@718e1924)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@6d112a64)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@5747e8e4)
- field (class: $line411.$read, name: $iw, type: class $iw)
- object (class $line411.$read, $line411.$read@59a0616c)
- field (class: $iw, name: $line411$read, type: class $line411.$read)
- object (class $iw, $iw@a375f8f)
- field (class: $iw, name: $outer, type: class $iw)
- object (class $iw, $iw@4e3978ff)
- field (class: $anonfun$1, name: $outer, type: class $iw)
- object (class $anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:337)
... 59 more
我还处于学习scala的阶段。有人可以建议我,如何将“Seq [Iterable [Int]]”对象写入文本文件
答案 0 :(得分:1)
由于您并不真的想让Spark手工保存数据和收集的结果很小 - 只需执行finalRDD.collect()
并应用任何解决方案将输出打印到文件{{3} }:
// taken from https://stackoverflow.com/a/4608061/44647
def printToFile(fileName: String)(op: java.io.PrintWriter => Unit) {
val p = new java.io.PrintWriter(fileName)
try { op(p) } finally { p.close() }
}
val collectedData: Seq[(Int, Seq[Iterable[Int]])] = finalRDD.collect()
val output: Seq[String] = collectedData
.map(_._2) // use only second part of tuple Seq[Iterable[Int]]
.map { seq: Seq[Iterable[Int]] =>
// render inner Iterable[Int] as String in ('1', '2', '3') format
val inner: Seq[String] = seq.map("(" + _.map(i => s"'$i'").mkString(", ") + ")")
inner.mkString(", ")
}
printToFile(outputFileName) { p => output.foreach(p.println) }
如果您的RDD更改架构 - 收集的集合的类型将更改,您将不得不调整此代码。
测试示例收集数据的输出(因为没有重构RDD的上下文):
('97'), ('98'), ('99'), ('100')
('97', '98'), ('97', '99'), ('97', '101')
('97', '98', '99'), ('99', '102', '103')
UPDATE :另一个答案https://stackoverflow.com/a/4608061/44647是正确的,您可以生成RDD [String]文本并通过Spark rdd.saveAsTextFile(...)
将文件保存到某处。但是这种方法存在一些潜在的问题(https://stackoverflow.com/a/49074625/44647中也有所涉及):
1)具有多个分区的RDD将生成多个文件(您必须执行rdd.repartition(1)
之类的操作,以至少确保生成一个包含数据的文件)
2)文件名被破坏(路径参数被视为目录名),也会产生一堆临时垃圾。在下面的示例中,RDD分为4个文件part-00000 ... part-00003,因为RDD有4个分区 - 说明1)+ 2):
scala> sc.parallelize(collectedData, 4).map(x => x._2.map("("+_.mkString(", ")+")").mkString(", ")).saveAsTextFile("/Users/igork/testdata/test6")
ls -al ~/testdata/test6
total 64
drwxr-xr-x 12 igork staff 408 Mar 2 11:40 .
drwxr-xr-x 10 igork staff 340 Mar 2 11:40 ..
-rw-r--r-- 1 igork staff 8 Mar 2 11:40 ._SUCCESS.crc
-rw-r--r-- 1 igork staff 8 Mar 2 11:40 .part-00000.crc
-rw-r--r-- 1 igork staff 12 Mar 2 11:40 .part-00001.crc
-rw-r--r-- 1 igork staff 12 Mar 2 11:40 .part-00002.crc
-rw-r--r-- 1 igork staff 12 Mar 2 11:40 .part-00003.crc
-rw-r--r-- 1 igork staff 0 Mar 2 11:40 _SUCCESS
-rw-r--r-- 1 igork staff 0 Mar 2 11:40 part-00000
-rw-r--r-- 1 igork staff 24 Mar 2 11:40 part-00001
-rw-r--r-- 1 igork staff 30 Mar 2 11:40 part-00002
-rw-r--r-- 1 igork staff 29 Mar 2 11:40 part-00003
3)当你在具有多个节点的Spark集群上运行时(特别是当worker和驱动程序在不同的主机上时),如果给定本地路径,它将在 worker 节点的本地文件系统上生成文件(并且可以分散part-0000 *不同工作节点之间的文件)。下面提供了在带有4个工作主机的Google Dataproc上运行的示例。要解决这个问题,您需要使用真正的分布式文件系统,如HDFS或blob存储,如S3或GCS,并从那里获取生成的文件。否则,您可以从工作节点检索多个文件。
测试作业的代码为main()
:
val collectedData: Seq[(Int, Seq[Seq[Int]])] =
Array((1, List(List(97), List(98), List(99), List(100))),
(2,List(List(97, 98), List(97, 99), List(97, 101))),
(3,List(List(97, 98, 99),List(99, 102, 103))))
val rdd = sc.parallelize(collectedData, 4)
val uniqueSuffix = UUID.randomUUID()
// expected to run on Spark executors
rdd.saveAsTextFile(s"file:///tmp/just-testing/$uniqueSuffix/test3")
// expected to run on Spark driver and find NO files
println("Files on driver:")
val driverHostName = InetAddress.getLocalHost.getHostName
Files.walk(Paths.get(s"/tmp/just-testing/$uniqueSuffix/test3"))
.toArray.map(driverHostName + " : " + _).foreach(println)
// just a *hack* to list files on every executor and get output to the driver
// PLEASE DON'T DO THAT IN PRODUCTION CODE
val outputRDD = rdd.mapPartitions[String] { _ =>
val hostName = InetAddress.getLocalHost.getHostName
Seq(Files.walk(Paths.get(s"/tmp/just-testing/$uniqueSuffix/test3"))
.toArray.map(hostName + " : " + _).mkString("\n")).toIterator
}
// expected to list files as was seen on executor nodes - multiple files should be present
println("Files on executors:")
outputRDD.collect().foreach(println)
请注意文件是如何在不同主机之间拆分的,而驱动程序dp-igork-test-m
根本没有有用的文件,因为它们位于工作节点dp-igork-test-w-*
上。测试作业的输出(为匿名更改了主机名):
18/03/02 20:54:00 INFO org.spark_project.jetty.util.log: Logging initialized @1950ms
18/03/02 20:54:00 INFO org.spark_project.jetty.server.Server: jetty-9.2.z-SNAPSHOT
18/03/02 20:54:00 INFO org.spark_project.jetty.server.ServerConnector: Started ServerConnector@772485dd{HTTP/1.1}{0.0.0.0:4172}
18/03/02 20:54:00 INFO org.spark_project.jetty.server.Server: Started @2094ms
18/03/02 20:54:00 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.3-hadoop2
18/03/02 20:54:01 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at dp-igork-test-m/10.142.0.2:8032
18/03/02 20:54:03 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1520023415468_0003
18/03/02 20:54:07 WARN org.apache.spark.SparkContext: Use an existing SparkContext, some configuration may not take effect.
Files on driver:
dp-igork-test-m : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3
dp-igork-test-m : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/._SUCCESS.crc
dp-igork-test-m : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_SUCCESS
Files on executors:
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000003_3
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000002_2
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00002
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00003
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00003.crc
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00002.crc
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00001.crc
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00000
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000001_1
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000000_0
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00001
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00000.crc
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000003_3
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000002_2
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00002
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00003
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00003.crc
dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00002.crc
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00001.crc
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00000
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000001_1
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000000_0
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00001
dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00000.crc
18/03/02 20:54:12 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector@772485dd{HTTP/1.1}{0.0.0.0:4172}
答案 1 :(得分:1)
您既不需要collect
rdd
也不需要PrintWriter
apis。
map
和mkString
函数的简单组合应该为您提供帮助,最后只需使用saveAsTextFile
api将rdd保存到文本文件中。
finalRDD.map(x => x._2.map("("+_.mkString(", ")+")").mkString(", ")).saveAsTextFile("path to output text file")
您的文本文件应包含以下文本行
(97), (98), (99), (100)
(97, 98), (97, 99), (97, 101)
(97, 98, 99), (99, 102, 103)