Spark:使用mappartitions进行多个API调用,导致java.lang.illegalstateException:连接池关闭

时间:2019-05-06 12:22:06

标签: scala http apache-spark

目标:要从搜索API获取数据框列中可用关键字的JSON响应。

+---------+--------+--------------------+------+
|searchKeyword   |Response                        |
+---------+--------+--------------------+------+
|  bags          |    [{"id":"4664"}.....      |
| sheet          |    [{"id":"976"}.....      |
| bottles        |    [{"id":"1234"}.....      |
| disposable bags|    [{"id":"234"}.....      |
+---------+--------+--------------------+------+

我获取了一些关键字的列表,然后将其转到了数据框中。之后,我通过执行mappartitions对这些关键字进行API调用,以便每个分区只能创建一个http连接。

但是,当我在rdd上执行操作时,却显示“连接池关闭错误”。

以下是使用mappartions的代码:-

val solrUrl = "http://%s:XXXXX/solr/%s/select?q=%s&fl=id,score&defType=edismax&wt=json"

def getHttpClient(): CloseableHttpClient = {
    val httpClient: CloseableHttpClient = HttpClients.createDefault();
    httpClient
  }


def getResults(url:String, httpClient:org.apache.http.impl.client.CloseableHttpClient): String = {
    val httpResponse = httpClient.execute(new HttpGet(url))
    val entity = httpResponse.getEntity()
    println(entity)
    var content = ""
    if (entity != null) {
      val inputStream = entity.getContent()
      content = scala.io.Source.fromInputStream(inputStream).getLines.mkString
      inputStream.close
    }
    httpClient.getConnectionManager().shutdown()
    return content
  }




val rddResults = searchTermsDf.rdd.mapPartitions(partition => {
  val connection = getHttpClient() 
  val newPartition = partition.map(keyword => {


  val searchTerm = keyword.getString(0)

  var url = solrUrl.format(HOST_IP,searchTerm)

  getResults(url,connection)
  }).toList // consumes the iterator, thus calls readMatchingFromDB 

  //println(newPartition)
  connection.close()
  newPartition.iterator // create a new iterator
})

rddResults.foreach(println)

如果我做错了事,请您帮我。

0 个答案:

没有答案