Elasticsearch + Spark:用自定义文档_id编写json

时间:2017-12-19 17:58:39

标签: scala apache-spark elasticsearch elasticsearch-hadoop

我正在尝试从Spark编写Elasticsearch中的对象集合。我必须满足两个要求:

  1. 文档已经在JSON中序列化,应按原样编写
  2. 应提供Elasticsearch文档_id
  3. 这是我到目前为止所尝试的内容。

    saveJsonToEs()

    我尝试使用这样的saveJsonToEs()(序列化文档包含带有所需Elasticsearch ID的字段_id):

    val rdd: RDD[String] = job.map{ r => r.toJson() }
    
    val cfg = Map(
      ("es.resource", "myindex/mytype"),
      ("es.mapping.id", "_id"),
      ("es.mapping.exclude", "_id")
    )
    
    EsSpark.saveJsonToEs(rdd, cfg)
    

    但是elasticsearch-hadoop库提供了这个例外:

    Caused by: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: When writing data as JSON, the field exclusion feature is ignored. This is most likely not what the user intended. Bailing out...
        at org.elasticsearch.hadoop.util.Assert.isTrue(Assert.java:60)
        at org.elasticsearch.hadoop.rest.InitializationUtils.validateSettings(InitializationUtils.java:253)
    

    如果我删除es.mapping.exclude但保留es.mapping.id并在内部发送带_id的JSON(如{"_id":"blah",...}

    val cfg = Map(
      ("es.resource", "myindex/mytype"),
      ("es.mapping.id", "_id")
    )
    
    EsSpark.saveJsonToEs(rdd, cfg)
    

    我收到此错误:

    Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 84.0 failed 4 times, most recent failure: Lost task 15.3 in stage 84.0 (TID 628, 172.31.35.69, executor 1): org.apache.spark.util.TaskCompletionListenerException: Found unrecoverable error [172.31.30.184:9200] returned Bad Request(400) - Field [_id] is a metadata field and cannot be added inside a document. Use the index API request parameters.; Bailing out..
        at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:105)
        at org.apache.spark.scheduler.Task.run(Task.scala:112)
    ...
    

    当我尝试将此ID作为其他字段发送时(例如{"superID":"blah",..."

     val cfg = Map(
      ("es.resource", "myindex/mytype"),
      ("es.mapping.id", "superID")
    )
    
    EsSpark.saveJsonToEs(rdd, cfg)
    

    无法提取字段:

    17/12/20 15:15:38 WARN TaskSetManager: Lost task 8.0 in stage 84.0 (TID 586, 172.31.33.56, executor 0): org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: [JsonExtractor for field [superId]] cannot extract value from entity [class java.lang.String] | instance [{...,"superID":"7f48c8ee6a8a"}]
        at org.elasticsearch.hadoop.serialization.bulk.AbstractBulkFactory$FieldWriter.write(AbstractBulkFactory.java:106)
        at org.elasticsearch.hadoop.serialization.bulk.TemplatedBulk.writeTemplate(TemplatedBulk.java:80)
        at org.elasticsearch.hadoop.serialization.bulk.TemplatedBulk.write(TemplatedBulk.java:56)
        at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:161)
        at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:67)
        at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:107)
        at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:107)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    

    当我从配置中删除es.mapping.ides.mapping.exclude时,它可以正常工作,但文档ID是由Elasticsearch生成的(违反了要求2):

    val rdd: RDD[String] = job.map{ r => r.toJson() }
    
    val cfg = Map(
      ("es.resource", "myindex/mytype"),
    )
    
    EsSpark.saveJsonToEs(rdd, cfg)
    

    saveToEsWithMeta()

    还有另一个函数可以提供_id和其他metadata用于插入:saveToEsWithMeta(),它允许解决要求2但失败但要求为1。

    val rdd: RDD[(String, String)] = job.map{
      r => r._id -> r.toJson()
    }
    
    val cfg = Map(
      ("es.resource", "myindex/mytype"),
    )
    
    EsSpark.saveToEsWithMeta(rdd, cfg)
    

    事实上,Elasticsearch甚至无法解析elasticsearch-hadoop发送的内容:

    Caused by: org.apache.spark.util.TaskCompletionListenerException: Found unrecoverable error [<es_host>:9200] returned Bad Request(400) - failed to parse; Bailing out..
        at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:105)
        at org.apache.spark.scheduler.Task.run(Task.scala:112)
    

    问题

    是否可以将Spark中的(documentID, serializedDocument)集合写入Elasticsearch(使用elasticsearch-hadoop)?

    P.S。我使用的是Elasticsearch 5.6.3和Spark 2.1.1。

5 个答案:

答案 0 :(得分:1)

你有没有试过像:

val rdd: RDD[String] = job.map{ r => r.toJson() }
val cfg = Map(
  ("es.mapping.id", "_id")
)
rdd.saveJsonToEs("myindex/mytype", cfg)

我已经测试过(使用elasticsearch-hadoop(连接器版本2.4.5)对抗ES 1.7)并且它可以工作。

答案 1 :(得分:1)

最后我发现了问题:这是配置中的拼写错误。

[JsonExtractor for field [superId]] cannot extract value from entity [class java.lang.String] | instance [{...,"superID":"7f48c8ee6a8a"}]

它正在寻找一个字段superID,但只有superID(注意案例)。在问题中它也有点误导,因为在代码中它看起来像"es.mapping.id", "superID"(这是不正确的)。

实际解决方案与Levi Ramsey建议类似:

val json = """{"foo":"bar","superID":"deadbeef"}"""

val rdd = spark.makeRDD(Seq(json))
val cfg = Map(
  ("es.mapping.id", "superID"),
  ("es.resource", "myindex/mytype")
)
EsSpark.saveJsonToEs(rdd, cfg = cfg)

区别在于es.mapping.id不能是_id(如原始帖子中所示,_id是元数据而Elasticsearch不接受它。)

当然,这意味着应将新字段superID添加到映射中(除非映射是动态的)。如果在索引中存储附加字段是一种负担,那么还应该:

  • exclude来自映射
  • 并停用其索引

非常感谢Alex Savitsky指向正确的方向。

答案 2 :(得分:0)

可以通过将ES_INPUT_JSON选项传递给cfg来完成  参数map,并从map函数返回一个元组,该元组包含文档id作为第一个元素,并以JSON序列化的文档作为第二个元素。

我用"org.elasticsearch" %% "elasticsearch-spark-20" % "[6.0,7.0["针对Elasticsearch 6.4进行了测试

import org.elasticsearch.hadoop.cfg.ConfigurationOptions.{ES_INPUT_JSON, ES_NODES}
import org.elasticsearch.spark._
import org.elasticsearch.spark.sql._

job
  .map{ r => (r._id, r.toJson()) }
  .saveToEsWithMeta(
    "myindex/mytype",
    Map(
      ES_NODES -> "https://localhost:9200",
      ES_INPUT_JSON -> true.toString
    )
  )

答案 3 :(得分:0)

我花了数天时间将头撞在墙上,试图弄清楚为什么saveToEsWithMeta在我使用字符串作为ID时不起作用:

rdd.map(caseClassContainingJson =>
  (caseClassContainingJson._idWhichIsAString, caseClassContainingJson.jsonString)
)
.saveToEsWithMeta(s"$nationalShapeIndexName/$nationalShapeIndexType", Map(
  ES_INPUT_JSON -> true.toString
))

这将引发与JSON解析相关的错误,从而欺骗您使您认为问题出在JSON上,但随后您记录了每个JSON,并发现它们都是有效的。

事实证明,无论出于何种原因,ES_INPUT_JSON -> true都会使元组的左侧(即ID)也被解析为JSON!

解决方案,JSON将ID字符串化(将ID用多余的双引号引起来),以便将其解析为JSON即可:

rdd.map(caseClassContainingJson =>
  (
    Json.stringify(JsString(caseClassContainingJson._idWhichIsAString)), 
    caseClassContainingJson.jsonString
  )
)
.saveToEsWithMeta(s"$nationalShapeIndexName/$nationalShapeIndexType", Map(
  ES_INPUT_JSON -> true.toString
))

答案 4 :(得分:0)

  1. 您可以使用saveToEs来定义customer_id,而不必保存customer_id
  2. 请注意rdd是RDD[Map]类型
val rdd:RDD[Map[String, Any]]=...
val cfg = Map(
  ("es.mapping.id", your_customer_id),
  ("es.mapping.exclude", your_customer_id)
)
EsSpark.saveToEs(rdd, your_es_index, cfg)