Spark httpclient作业不能很好地并行

时间:2018-04-04 01:41:04

标签: scala http apache-spark

我是scala / spark的新手。我需要编写一个spark作业,根据输入的urls.txt文件调用API。以下是我的示例代码。我认为以下的Await限制了这项工作,但由于我的经验有限,我无法想出更好地实现目标的方法。任何帮助都非常感谢。

object MyApp extends App {

  val partitions = 10
  val textFile = sc.textFile(s"file:///tmp/urls.txt",partitions)
  val futures = textFile.flatMap(url => {
    val wsclient = InitializeConfigurations.getWSClient()
    var future = wsclient.url(url).withRequestTimeout(10 seconds).get().map(response => {
        s"${response.body}"
    })

    future onComplete {
      case Success(res) =>
        println(s"oncomplete: res = $res")
      case Failure(ex) =>
        ex match {
          case t: TimeoutException => None
          case _ =>
            ex.printStackTrace()
        }
    }      

    Some(future)
   })

  val hresps = futures.flatMap(f => {
    try {
      val line = Await.result(f, 10 seconds)
    } catch {
      case e: Exception => {
        None
      }
    }
  })

  hresps.saveAsTextFile(s"file:///tmp/a01.txt")

0 个答案:

没有答案