Question

我正在运行一个火花工作，在某些时候我想连接到弹性搜索服务器以获取一些数据并将它们添加到RDD。所以我使用的代码看起来像这样

 input.mapParitions(records=>{
  val elcon=new ElasticSearchConnection
  val client:TransportClient=elcon.openConnection()
 val newRecs=records.flatMap(record=>{
      val response = client.prepareGet("index" "indexType",
      record.id.toString).execute().actionGet()
       val newRec=processRec(record,reponse)
       newRec
   })//end of flatMap
   client.close()
   newRecs
 })//end of mapPartitions

我的问题是在client.close()操作完成之前调用了flatMap命令，当然这会导致异常。如果我在flatMap内移动连接的生成和关闭，代码就可以工作，但这会产生大量的连接。是否可以确保在flatMap操作完成后调用client.close？

Answer 1

对RDD中的每个项目进行阻止调用以获取相应的ElasticSearch文档会导致问题。通常建议避免阻止呼叫。

使用ElasticSearch-for-Hadoop's Spark support还有另一种替代方法。

将ElasticSearch索引/类型作为另一个RDD读取，并将其与您的RDD连接。

包括ESHadoop dependency的正确版本。

import org.elasticsearch.spark._
val esRdd = sc.esRDD("index/indexType")   //This returns a pair RDD of (_id, Map of all key value pairs for all fields]
input.map(record => (record.id, record))  //Convert your RDD of records to a pair rdd of (id, record) as we want to join based on the id
input.join(esRdd).map(rec => processResponse(rec._2._1, rec._2._2)) // Join the two RDDs based on id column it returns a pair RDD with key=id & value=Pair of matching records (id,(inputrddrecord,esrddrecord))

希望这有帮助。

PS：它仍然无法缓解缺乏共址的问题。（即每个带有_id的文档都来自索引的不同分片）。更好的方法是在创建ES索引时实现输入RDD和ES索引文档的共址。

Spark，mapPartitions，网络连接在映射操作完成之前关闭

1 个答案: