我有一个需要做几件事的对象:
现在,我正在尝试以异步方式执行此操作。但这似乎不起作用。我的猜测是Spark没有看到对.parallelize
的调用,因为它们是在不同的任务(或Future
中进行的,因此这段代码在之前/之后被调用,或者可能在未设置的上下文中被调用(是真的吗?))。我尝试了不同的方法,其中一种方式是调用SparkEnv.set
和flatMap
(未来)中调用map
。但是,我得到的只是无法在已停止的SparkContext上调用方法。它只是不起作用 - 也许我只是误解了它的作用,因此我删除了它。
这是我到目前为止编写的代码:
object Fetcher {
def fetch(name, master, ...) = {
val externalCallOne: Future[WSResponse] = externalService1()
val externalCallTwo: Future[String] = externalService2()
// val sparkEnv = SparkEnv.get
val config = new SparkConf()
.setAppName(name)
.set("spark.master", master)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sparkContext = new SparkContext(config)
//val sparkEnv = SparkEnv.get
val eventuallyJoinedData = externalCallOne flatMap { dataOne =>
// SparkEnv.set(sparkEnv)
externalCallTwo map { dataTwo =>
println("in map") // prints, so it gets here ...
val rddOne = sparkContext.parallelize(dataOne)
val rddTwo = sparkContext.parallelize(dataTwo)
// do stuff here ... foreach/println, and
val joinedData = rddOne leftOuterJoin (rddTwo)
}
}
eventuallyJoinedData onSuccess { case success => ... }
eventuallyJoinedData onFailure { case error => println(error.getMessage) }
// sparkContext.stop
}
}
正如你所看到的,我也尝试评论该行以停止上下文,但后来又出现了另一个问题:
13:09:14.929 [ForkJoinPool-1-worker-5] INFO org.apache.spark.SparkContext - Starting job: count at Fetcher.scala:38
13:09:14.932 [shuffle-server-0] DEBUG io.netty.channel.nio.NioEventLoop - Selector.select() returned prematurely because Thread.currentThread().interrupt() was called. Use NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
13:09:14.936 [Spark Context Cleaner] ERROR org.apache.spark.ContextCleaner - Error in cleaning thread
java.lang.InterruptedException: null
at java.lang.Object.wait(Native Method) ~[na:1.8.0_65]
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143) ~[na:1.8.0_65]
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:157) ~[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136) [spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154) [spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:67) [spark-core_2.10-1.5.1.jar:1.5.1]
13:09:14.940 [db-async-netty-thread-1] DEBUG io.netty.channel.nio.NioEventLoop - Selector.select() returned prematurely because Thread.currentThread().interrupt() was called. Use NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
13:09:14.943 [SparkListenerBus] ERROR org.apache.spark.util.Utils - uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[na:1.8.0_65]
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[na:1.8.0_65]
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312) ~[na:1.8.0_65]
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:65) ~[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136) ~[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) [spark-core_2.10-1.5.1.jar:1.5.1]
13:09:14.949 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle - stopping org.spark-project.jetty.server.Server@787cbcef
13:09:14.959 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle - stopping SelectChannelConnector@0.0.0.0:4040
13:09:14.959 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle - stopping org.spark-project.jetty.server.nio.SelectChannelConnector$ConnectorSelectorManager@797cc465
如您所见,它尝试在RDD上调用count操作,但随后失败(可能是因为SparkContext为null(?))。
如何解决此问题?需要做什么?我是否需要切换到同步架构?
我在SBT和Scala 2.10.6中使用Spark 1.5.1。
答案 0 :(得分:0)
我在Spark Mailing列表上得到了几个答案。您可以阅读完整的讨论here。
似乎不可能在Spark中使用Future,除非你像Ooyala-server那样做一些魔术。但是,为了不拥有复杂的代码,拥有不同的架构(使用Kafka / Flume / ...)会更好。