在scala Future中调用SparkContext方法

时间:2016-01-18 13:27:40

标签: scala apache-spark

我有一个需要做几件事的对象:

  1. 致电外部服务One(web api)
  2. 拨打外部服务二(另一个api)
  3. 从HDFS(Spark)读取并生成RDD
  4. 并行化前两次调用中获得的数据
  5. 加入这些不同的rdds,用它们做点什么......
  6. 现在,我正在尝试以异步方式执行此操作。但这似乎不起作用。我的猜测是Spark没有看到对.parallelize的调用,因为它们是在不同的任务(或Future中进行的,因此这段代码在之前/之后被调用,或者可能在未设置的上下文中被调用(是真的吗?))。我尝试了不同的方法,其中一种方式是调用SparkEnv.setflatMap(未来)中调用map。但是,我得到的只是无法在已停止的SparkContext上调用方法。它只是不起作用 - 也许我只是误解了它的作用,因此我删除了它。

    这是我到目前为止编写的代码:

    object Fetcher { 
    
      def fetch(name, master, ...) = { 
        val externalCallOne: Future[WSResponse] = externalService1()
        val externalCallTwo: Future[String] = externalService2()
        // val sparkEnv = SparkEnv.get
        val config = new SparkConf()
        .setAppName(name)
        .set("spark.master", master)
        .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    
        val sparkContext = new SparkContext(config)
        //val sparkEnv = SparkEnv.get
    
        val eventuallyJoinedData = externalCallOne flatMap { dataOne =>
          // SparkEnv.set(sparkEnv)
          externalCallTwo map { dataTwo =>
            println("in map") // prints, so it gets here ...
            val rddOne = sparkContext.parallelize(dataOne)
            val rddTwo = sparkContext.parallelize(dataTwo)
            // do stuff here ... foreach/println, and 
    
            val joinedData = rddOne leftOuterJoin (rddTwo)
          }
        } 
        eventuallyJoinedData onSuccess { case success => ...  }
        eventuallyJoinedData onFailure { case error => println(error.getMessage) } 
        // sparkContext.stop 
      } 
    
    }
    

    正如你所看到的,我也尝试评论该行以停止上下文,但后来又出现了另一个问题:

    13:09:14.929 [ForkJoinPool-1-worker-5] INFO  org.apache.spark.SparkContext - Starting job: count at Fetcher.scala:38
    13:09:14.932 [shuffle-server-0] DEBUG io.netty.channel.nio.NioEventLoop - Selector.select() returned prematurely because Thread.currentThread().interrupt() was called. Use NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
    13:09:14.936 [Spark Context Cleaner] ERROR org.apache.spark.ContextCleaner - Error in cleaning thread
    java.lang.InterruptedException: null
        at java.lang.Object.wait(Native Method) ~[na:1.8.0_65]
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143) ~[na:1.8.0_65]
        at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:157) ~[spark-core_2.10-1.5.1.jar:1.5.1]
        at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136) [spark-core_2.10-1.5.1.jar:1.5.1]
        at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154) [spark-core_2.10-1.5.1.jar:1.5.1]
        at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:67) [spark-core_2.10-1.5.1.jar:1.5.1]
    13:09:14.940 [db-async-netty-thread-1] DEBUG io.netty.channel.nio.NioEventLoop - Selector.select() returned prematurely because Thread.currentThread().interrupt() was called. Use NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
    13:09:14.943 [SparkListenerBus] ERROR org.apache.spark.util.Utils - uncaught error in thread SparkListenerBus, stopping SparkContext
    java.lang.InterruptedException: null
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[na:1.8.0_65]
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[na:1.8.0_65]
        at java.util.concurrent.Semaphore.acquire(Semaphore.java:312) ~[na:1.8.0_65]
        at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:65) ~[spark-core_2.10-1.5.1.jar:1.5.1]
        at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136) ~[spark-core_2.10-1.5.1.jar:1.5.1]
        at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) [spark-core_2.10-1.5.1.jar:1.5.1]
    13:09:14.949 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle - stopping org.spark-project.jetty.server.Server@787cbcef
    13:09:14.959 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle - stopping SelectChannelConnector@0.0.0.0:4040
    13:09:14.959 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle - stopping org.spark-project.jetty.server.nio.SelectChannelConnector$ConnectorSelectorManager@797cc465
    

    如您所见,它尝试在RDD上调用count操作,但随后失败(可能是因为SparkContext为null(?))。

    如何解决此问题?需要做什么?我是否需要切换到同步架构?

    我在SBT和Scala 2.10.6中使用Spark 1.5.1。

1 个答案:

答案 0 :(得分:0)

我在Spark Mailing列表上得到了几个答案。您可以阅读完整的讨论here

似乎不可能在Spark中使用Future,除非你像Ooyala-server那样做一些魔术。但是,为了不拥有复杂的代码,拥有不同的架构(使用Kafka / Flume / ...)会更好。