Scala:将RDD写入.txt文件

时间:2018-03-02 11:14:19

标签: scala rdd printwriter

我的最终RDD看起来像这样

FinalRDD.collect()

Array[(Int, Seq[Iterable[Int]])] = Array((1,List(List(97), List(98), List(99), List(100))), (2,List(List(97, 98), List(97, 99), List(97, 101))), (3,List(List(97, 98, 99),List(99, 102, 103))))

我想以下列格式将此RDD写入文本文件

('97'), ('98'), ('100')

('97', '98'), ('97', '99'), List(97, 101)

('97','98', '99'), ('97', '99', '101')

我发现许多网站建议将java.io中的PrintWriter类作为实现此目的的一个选项。这是我尝试过的代码。

val writer = new PrintWriter(new File(outputFName))

def writefunc(chunk : Seq[Iterable[Int]])
{
  var n=chunk
  print("inside write func")
  for(i <- 0 until n.length)
  {
    writer.print("('"+n(i)+"')"+", ")

  }
 }

finalRDD.mapValues(list =>writefunc(list)).collect()

我最终得到如下所示的任务可序列错误

finalRDD.mapValues(list =>writefunc(list)).collect()
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1.apply(PairRDDFunctions.scala:758)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1.apply(PairRDDFunctions.scala:757)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.mapValues(PairRDDFunctions.scala:757)
... 50 elided
Caused by: java.io.NotSerializableException: java.io.PrintWriter
Serialization stack:
- object not serializable (class: java.io.PrintWriter, value:   java.io.PrintWriter@b0c0abe)
- field (class: $iw, name: writer, type: class java.io.PrintWriter)
- object (class $iw, $iw@31afbb30)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@672ca5ae)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@528ac6dd)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@b772a0e)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@7b11bb43)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@94c2342)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@2bacf377)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@718e1924)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@6d112a64)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@5747e8e4)
- field (class: $line411.$read, name: $iw, type: class $iw)
- object (class $line411.$read, $line411.$read@59a0616c)
- field (class: $iw, name: $line411$read, type: class $line411.$read)
- object (class $iw, $iw@a375f8f)
- field (class: $iw, name: $outer, type: class $iw)
- object (class $iw, $iw@4e3978ff)
- field (class: $anonfun$1, name: $outer, type: class $iw)
- object (class $anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:337)
... 59 more

我还处于学习scala的阶段。有人可以建议我,如何将“Seq [Iterable [Int]]”对象写入文本文件

2 个答案:

答案 0 :(得分:1)

由于您并不真的想让Spark手工保存数据和收集的结果很小 - 只需执行finalRDD.collect()并应用任何解决方案将输出打印到文件{{3} }:

// taken from https://stackoverflow.com/a/4608061/44647
def printToFile(fileName: String)(op: java.io.PrintWriter => Unit) {
  val p = new java.io.PrintWriter(fileName)
  try { op(p) } finally { p.close() }
}

val collectedData: Seq[(Int, Seq[Iterable[Int]])] = finalRDD.collect()
val output: Seq[String] = collectedData
  .map(_._2) // use only second part of tuple Seq[Iterable[Int]]
  .map { seq: Seq[Iterable[Int]] =>
     // render inner Iterable[Int] as String in ('1', '2', '3') format
     val inner: Seq[String] = seq.map("(" + _.map(i => s"'$i'").mkString(", ") + ")")
     inner.mkString(", ")
  }

printToFile(outputFileName) { p => output.foreach(p.println) }

如果您的RDD更改架构 - 收集的集合的类型将更改,您将不得不调整此代码。

测试示例收集数据的输出(因为没有重构RDD的上下文):

('97'), ('98'), ('99'), ('100')
('97', '98'), ('97', '99'), ('97', '101')
('97', '98', '99'), ('99', '102', '103')

UPDATE :另一个答案https://stackoverflow.com/a/4608061/44647是正确的,您可以生成RDD [String]文本并通过Spark rdd.saveAsTextFile(...)将文件保存到某处。但是这种方法存在一些潜在的问题(https://stackoverflow.com/a/49074625/44647中也有所涉及):

1)具有多个分区的RDD将生成多个文件(您必须执行rdd.repartition(1)之类的操作,以至少确保生成一个包含数据的文件)

2)文件名被破坏(路径参数被视为目录名),也会产生一堆临时垃圾。在下面的示例中,RDD分为4个文件part-00000 ... part-00003,因为RDD有4个分区 - 说明1)+ 2):

scala> sc.parallelize(collectedData, 4).map(x => x._2.map("("+_.mkString(", ")+")").mkString(", ")).saveAsTextFile("/Users/igork/testdata/test6")

 ls -al  ~/testdata/test6
total 64
drwxr-xr-x  12 igork  staff  408 Mar  2 11:40 .
drwxr-xr-x  10 igork  staff  340 Mar  2 11:40 ..
-rw-r--r--   1 igork  staff    8 Mar  2 11:40 ._SUCCESS.crc
-rw-r--r--   1 igork  staff    8 Mar  2 11:40 .part-00000.crc
-rw-r--r--   1 igork  staff   12 Mar  2 11:40 .part-00001.crc
-rw-r--r--   1 igork  staff   12 Mar  2 11:40 .part-00002.crc
-rw-r--r--   1 igork  staff   12 Mar  2 11:40 .part-00003.crc
-rw-r--r--   1 igork  staff    0 Mar  2 11:40 _SUCCESS
-rw-r--r--   1 igork  staff    0 Mar  2 11:40 part-00000
-rw-r--r--   1 igork  staff   24 Mar  2 11:40 part-00001
-rw-r--r--   1 igork  staff   30 Mar  2 11:40 part-00002
-rw-r--r--   1 igork  staff   29 Mar  2 11:40 part-00003

3)当你在具有多个节点的Spark集群上运行时(特别是当worker和驱动程序在不同的主机上时),如果给定本地路径,它将在 worker 节点的本地文件系统上生成文件(并且可以分散part-0000 *不同工作节点之间的文件)。下面提供了在带有4个工作主机的Google Dataproc上运行的示例。要解决这个问题,您需要使用真正的分布式文件系统,如HDFS或blob存储,如S3或GCS,并从那里获取生成的文件。否则,您可以从工作节点检索多个文件。

测试作业的代码为main()

val collectedData: Seq[(Int, Seq[Seq[Int]])] =
  Array((1, List(List(97), List(98), List(99), List(100))),
    (2,List(List(97, 98), List(97, 99), List(97, 101))),
    (3,List(List(97, 98, 99),List(99, 102, 103))))
val rdd = sc.parallelize(collectedData, 4)

val uniqueSuffix = UUID.randomUUID()

// expected to run on Spark executors
rdd.saveAsTextFile(s"file:///tmp/just-testing/$uniqueSuffix/test3")

// expected to run on Spark driver and find NO files
println("Files on driver:")
val driverHostName = InetAddress.getLocalHost.getHostName
Files.walk(Paths.get(s"/tmp/just-testing/$uniqueSuffix/test3"))
  .toArray.map(driverHostName + " : " + _).foreach(println)

// just a *hack* to list files on every executor and get output to the driver
// PLEASE DON'T DO THAT IN PRODUCTION CODE
val outputRDD = rdd.mapPartitions[String] { _ =>
  val hostName = InetAddress.getLocalHost.getHostName
  Seq(Files.walk(Paths.get(s"/tmp/just-testing/$uniqueSuffix/test3"))
    .toArray.map(hostName + " : " + _).mkString("\n")).toIterator
}

// expected to list files as was seen on executor nodes - multiple files should be present
println("Files on executors:")
outputRDD.collect().foreach(println)

请注意文件是如何在不同主机之间拆分的,而驱动程序dp-igork-test-m根本没有有用的文件,因为它们位于工作节点dp-igork-test-w-*上。测试作业的输出(为匿名更改了主机名):

18/03/02 20:54:00 INFO org.spark_project.jetty.util.log: Logging initialized @1950ms

18/03/02 20:54:00 INFO org.spark_project.jetty.server.Server: jetty-9.2.z-SNAPSHOT

18/03/02 20:54:00 INFO org.spark_project.jetty.server.ServerConnector: Started ServerConnector@772485dd{HTTP/1.1}{0.0.0.0:4172}

18/03/02 20:54:00 INFO org.spark_project.jetty.server.Server: Started @2094ms

18/03/02 20:54:00 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.3-hadoop2

18/03/02 20:54:01 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at dp-igork-test-m/10.142.0.2:8032

18/03/02 20:54:03 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1520023415468_0003

18/03/02 20:54:07 WARN org.apache.spark.SparkContext: Use an existing SparkContext, some configuration may not take effect.

Files on driver:

dp-igork-test-m : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3

dp-igork-test-m : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/._SUCCESS.crc

dp-igork-test-m : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_SUCCESS

Files on executors:

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000003_3

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000002_2

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00002

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00003

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00003.crc

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00002.crc

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00001.crc

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00000

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000001_1

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000000_0

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00001

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00000.crc

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000003_3

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000002_2

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00002

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00003

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00003.crc

dp-igork-test-w-1 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00002.crc

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00001.crc

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00000

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000001_1

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/_temporary/0/_temporary/attempt_201803022054_0000_m_000000_0

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/part-00001

dp-igork-test-w-0 : /tmp/just-testing/50a5710f-5ff2-4145-8922-1befaf5b6740/test3/.part-00000.crc

18/03/02 20:54:12 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector@772485dd{HTTP/1.1}{0.0.0.0:4172}

答案 1 :(得分:1)

您既不需要collect rdd也不需要PrintWriter apis。

mapmkString函数的简单组合应该为您提供帮助,最后只需使用saveAsTextFile api将rdd保存到文本文件中。

finalRDD.map(x => x._2.map("("+_.mkString(", ")+")").mkString(", ")).saveAsTextFile("path to output text file")

您的文本文件应包含以下文本行

(97), (98), (99), (100)
(97, 98), (97, 99), (97, 101)
(97, 98, 99), (99, 102, 103)