从Cassandra写到json后,Spark提交挂起

时间:2016-11-04 18:16:46

标签: scala apache-spark amazon-s3 spark-cassandra-connector

我有一个驱动程序,我使用spark从Cassandra写入读取数据,执行一些操作,然后在S3上写出JSON。当我使用Spark 1.6.1和spark-cassandra-connector 1.6.0-M1时,程序运行正常。

但是,如果我尝试升级到Spark 2.0.1(hadoop 2.7.1)和spark-cassandra-connector 2.0.0-M3,程序就会在所有预期文件都写入S3的意义上完成,但是该程序永远不会终止。

我在程序结束时运行sc.stop()。我也在使用Mesos 1.0.1。在这两种情况下,我都使用默认输出提交器。

编辑:查看下面的线程转储,似乎可以等待:org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner

代码段:

// get MongoDB oplog operations
val operations = sc.cassandraTable[JsonOperation](keyspace, namespace)
  .where("ts >= ? AND ts < ?", minTimestamp, maxTimestamp)

// replay oplog operations into documents
val documents = operations
  .spanBy(op => op.id)
  .map { case (id: String, ops: Iterable[T]) => (id, apply(ops)) }
  .filter { case (id, result) => result.isInstanceOf[Document] }
  .map { case (id, document) => MergedDocument(id = id, document = document
    .asInstanceOf[Document])
  }

// write documents to json on s3
documents
  .map(document => document.toJson)
  .coalesce(partitions)
  .saveAsTextFile(path, classOf[GzipCodec])
sc.stop()

驱动程序上的线程转储:

    60  context-cleaner-periodic-gc TIMED_WAITING
46  dag-scheduler-event-loop    WAITING
4389    DestroyJavaVM   RUNNABLE
12  dispatcher-event-loop-0 WAITING
13  dispatcher-event-loop-1 WAITING
14  dispatcher-event-loop-2 WAITING
15  dispatcher-event-loop-3 WAITING
47  driver-revive-thread    TIMED_WAITING
3   Finalizer   WAITING
82  ForkJoinPool-1-worker-17    WAITING
43  heartbeat-receiver-event-loop-thread    TIMED_WAITING
93  java-sdk-http-connection-reaper TIMED_WAITING
4387    java-sdk-progress-listener-callback-thread  WAITING
25  map-output-dispatcher-0 WAITING
26  map-output-dispatcher-1 WAITING
27  map-output-dispatcher-2 WAITING
28  map-output-dispatcher-3 WAITING
29  map-output-dispatcher-4 WAITING
30  map-output-dispatcher-5 WAITING
31  map-output-dispatcher-6 WAITING
32  map-output-dispatcher-7 WAITING
48  MesosCoarseGrainedSchedulerBackend-mesos-driver RUNNABLE
44  netty-rpc-env-timeout   TIMED_WAITING
92  org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner   WAITING
62  pool-19-thread-1    TIMED_WAITING
2   Reference Handler   WAITING
61  Scheduler-1112394071    TIMED_WAITING
20  shuffle-server-0    RUNNABLE
55  shuffle-server-0    RUNNABLE
21  shuffle-server-1    RUNNABLE
56  shuffle-server-1    RUNNABLE
22  shuffle-server-2    RUNNABLE
57  shuffle-server-2    RUNNABLE
23  shuffle-server-3    RUNNABLE
58  shuffle-server-3    RUNNABLE
4   Signal Dispatcher   RUNNABLE
59  Spark Context Cleaner   TIMED_WAITING
9   SparkListenerBus    WAITING
35  SparkUI-35-selector-ServerConnectorManager@651d3734/0   RUNNABLE
36  SparkUI-36-acceptor-0@467924cb-ServerConnector@3b5eaf92{HTTP/1.1}{0.0.0.0:4040} RUNNABLE
37  SparkUI-37-selector-ServerConnectorManager@651d3734/1   RUNNABLE
38  SparkUI-38  TIMED_WAITING
39  SparkUI-39  TIMED_WAITING
40  SparkUI-40  TIMED_WAITING
41  SparkUI-41  RUNNABLE
42  SparkUI-42  TIMED_WAITING
438 task-result-getter-0    WAITING
450 task-result-getter-1    WAITING
489 task-result-getter-2    WAITING
492 task-result-getter-3    WAITING
75  threadDeathWatcher-2-1  TIMED_WAITING
45  Timer-0 WAITING

执行程序上的线程转储。所有这些都是一样的:

24  dispatcher-event-loop-0 WAITING
25  dispatcher-event-loop-1 WAITING
26  dispatcher-event-loop-2 RUNNABLE
27  dispatcher-event-loop-3 WAITING
39  driver-heartbeater  TIMED_WAITING
3   Finalizer   WAITING
58  java-sdk-http-connection-reaper TIMED_WAITING
75  java-sdk-progress-listener-callback-thread  WAITING
1   main    TIMED_WAITING
33  netty-rpc-env-timeout   TIMED_WAITING
55  org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner   WAITING
59  pool-17-thread-1    TIMED_WAITING
2   Reference Handler   WAITING
28  shuffle-client-0    RUNNABLE
35  shuffle-client-0    RUNNABLE
41  shuffle-client-0    RUNNABLE
37  shuffle-server-0    RUNNABLE
5   Signal Dispatcher   RUNNABLE
23  threadDeathWatcher-2-1  TIMED_WAITING

1 个答案:

答案 0 :(得分:0)

我通过在程序jar中更新以下包来解决这个问题:

  • spark 2.0.0 to 2.0.1
  • json4s 3.2.11至3.5.0
  • 扇贝2.0.1至2.0.5
  • nscala-time 1.8.0至2.14.0