我正在尝试将一些logdata保存到elasticsearch,flume - > kafka - > sparkstreaming - > elasticsearch。在步骤:sparkstreaming到弹性搜索,我们花了很多时间。 粗略地说,将1000个数据保存到elasticsearch需要一分钟。
message.foreachRdd(rdd=>{
//step 1. According to index and type of ES to categorize the rdd and save to a hashmap
val logRdds = new collection.mutable.HashMap[String, RDD[String]]
logRdds += (“index1/type1” -> rdd.filter(x => isIndexAndType(x, “index1”,”type1”)))
logRdds += (“index2/type2” -> rdd.filter(x => isIndexAndType(x, “index2”,”type2”)))
logRdds += (“index3/type3” -> rdd.filter(x => isIndexAndType(x, “index3”,”type3”)))
logRdds += (“index4/type4” -> rdd.filter(x => isIndexAndType(x, “index4”,”type4”)))
logRdds += (“index5/type5” -> rdd.filter(x => isIndexAndType(x, “index5”,”type5”)))
logRdds += (“index6/type6” -> rdd.filter(x => isIndexAndType(x, “index6”,”type6”)))
logRdds += (“index7/type7” -> rdd.filter(x => isIndexAndType(x, “index7”,”type7”)))
logRdds += (“index8/type8” -> rdd.filter(x => isIndexAndType(x, “index8”,”type8”)))
logRdds += (“index9/type9” -> rdd.filter(x => isIndexAndType(x, “index9”,”type9”)))
logRdds += (“index10/type10” -> rdd.filter(x => isIndexAndType(x, “index10”,”type10”)))
logRdds += (“index11/type11” -> rdd.filter(x => isIndexAndType(x, “index11”,”type11”)))
//step 2. deal with the rdd(map and filter)
for((k,v) <- logRdds){
logRdds(k) = v.map(x => getLogContent(x))
}
for((k,v) <- logRdds){
logRdds(k) = v.filter(x => isJson(x))
}
//step 3. Save json to elasticsearch
logRdds.foreach(item =>{
if(item._2.count() > 0 ){
EsSpark.saveJsonToEs(item._2, item._1)
}
})
})
为什么要花费大量时间在sparkstream中进行弹性搜索? 很多rdd?我该如何改进代码?