我正在尝试从Spark编写Elasticsearch中的对象集合。我必须满足两个要求:
_id
这是我到目前为止所尝试的内容。
saveJsonToEs()
我尝试使用这样的saveJsonToEs()
(序列化文档包含带有所需Elasticsearch ID的字段_id
):
val rdd: RDD[String] = job.map{ r => r.toJson() }
val cfg = Map(
("es.resource", "myindex/mytype"),
("es.mapping.id", "_id"),
("es.mapping.exclude", "_id")
)
EsSpark.saveJsonToEs(rdd, cfg)
但是elasticsearch-hadoop
库提供了这个例外:
Caused by: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: When writing data as JSON, the field exclusion feature is ignored. This is most likely not what the user intended. Bailing out...
at org.elasticsearch.hadoop.util.Assert.isTrue(Assert.java:60)
at org.elasticsearch.hadoop.rest.InitializationUtils.validateSettings(InitializationUtils.java:253)
如果我删除es.mapping.exclude
但保留es.mapping.id
并在内部发送带_id
的JSON(如{"_id":"blah",...}
)
val cfg = Map(
("es.resource", "myindex/mytype"),
("es.mapping.id", "_id")
)
EsSpark.saveJsonToEs(rdd, cfg)
我收到此错误:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 84.0 failed 4 times, most recent failure: Lost task 15.3 in stage 84.0 (TID 628, 172.31.35.69, executor 1): org.apache.spark.util.TaskCompletionListenerException: Found unrecoverable error [172.31.30.184:9200] returned Bad Request(400) - Field [_id] is a metadata field and cannot be added inside a document. Use the index API request parameters.; Bailing out..
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:105)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
...
当我尝试将此ID作为其他字段发送时(例如{"superID":"blah",..."
:
val cfg = Map(
("es.resource", "myindex/mytype"),
("es.mapping.id", "superID")
)
EsSpark.saveJsonToEs(rdd, cfg)
无法提取字段:
17/12/20 15:15:38 WARN TaskSetManager: Lost task 8.0 in stage 84.0 (TID 586, 172.31.33.56, executor 0): org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: [JsonExtractor for field [superId]] cannot extract value from entity [class java.lang.String] | instance [{...,"superID":"7f48c8ee6a8a"}]
at org.elasticsearch.hadoop.serialization.bulk.AbstractBulkFactory$FieldWriter.write(AbstractBulkFactory.java:106)
at org.elasticsearch.hadoop.serialization.bulk.TemplatedBulk.writeTemplate(TemplatedBulk.java:80)
at org.elasticsearch.hadoop.serialization.bulk.TemplatedBulk.write(TemplatedBulk.java:56)
at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:161)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:67)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:107)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:107)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
当我从配置中删除es.mapping.id
和es.mapping.exclude
时,它可以正常工作,但文档ID是由Elasticsearch生成的(违反了要求2):
val rdd: RDD[String] = job.map{ r => r.toJson() }
val cfg = Map(
("es.resource", "myindex/mytype"),
)
EsSpark.saveJsonToEs(rdd, cfg)
saveToEsWithMeta()
还有另一个函数可以提供_id
和其他metadata用于插入:saveToEsWithMeta()
,它允许解决要求2但失败但要求为1。
val rdd: RDD[(String, String)] = job.map{
r => r._id -> r.toJson()
}
val cfg = Map(
("es.resource", "myindex/mytype"),
)
EsSpark.saveToEsWithMeta(rdd, cfg)
事实上,Elasticsearch甚至无法解析elasticsearch-hadoop
发送的内容:
Caused by: org.apache.spark.util.TaskCompletionListenerException: Found unrecoverable error [<es_host>:9200] returned Bad Request(400) - failed to parse; Bailing out..
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:105)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
是否可以将Spark中的(documentID, serializedDocument)
集合写入Elasticsearch(使用elasticsearch-hadoop
)?
P.S。我使用的是Elasticsearch 5.6.3和Spark 2.1.1。
答案 0 :(得分:1)
你有没有试过像:
val rdd: RDD[String] = job.map{ r => r.toJson() }
val cfg = Map(
("es.mapping.id", "_id")
)
rdd.saveJsonToEs("myindex/mytype", cfg)
我已经测试过(使用elasticsearch-hadoop(连接器版本2.4.5)对抗ES 1.7)并且它可以工作。
答案 1 :(得分:1)
最后我发现了问题:这是配置中的拼写错误。
[JsonExtractor for field [superId]] cannot extract value from entity [class java.lang.String] | instance [{...,"superID":"7f48c8ee6a8a"}]
它正在寻找一个字段superID
,但只有superID
(注意案例)。在问题中它也有点误导,因为在代码中它看起来像"es.mapping.id", "superID"
(这是不正确的)。
实际解决方案与Levi Ramsey建议类似:
val json = """{"foo":"bar","superID":"deadbeef"}"""
val rdd = spark.makeRDD(Seq(json))
val cfg = Map(
("es.mapping.id", "superID"),
("es.resource", "myindex/mytype")
)
EsSpark.saveJsonToEs(rdd, cfg = cfg)
区别在于es.mapping.id
不能是_id
(如原始帖子中所示,_id
是元数据而Elasticsearch不接受它。)
当然,这意味着应将新字段superID
添加到映射中(除非映射是动态的)。如果在索引中存储附加字段是一种负担,那么还应该:
非常感谢Alex Savitsky指向正确的方向。
答案 2 :(得分:0)
可以通过将ES_INPUT_JSON
选项传递给cfg
来完成
参数map,并从map函数返回一个元组,该元组包含文档id作为第一个元素,并以JSON序列化的文档作为第二个元素。
我用"org.elasticsearch" %% "elasticsearch-spark-20" % "[6.0,7.0["
针对Elasticsearch 6.4进行了测试
import org.elasticsearch.hadoop.cfg.ConfigurationOptions.{ES_INPUT_JSON, ES_NODES}
import org.elasticsearch.spark._
import org.elasticsearch.spark.sql._
job
.map{ r => (r._id, r.toJson()) }
.saveToEsWithMeta(
"myindex/mytype",
Map(
ES_NODES -> "https://localhost:9200",
ES_INPUT_JSON -> true.toString
)
)
答案 3 :(得分:0)
我花了数天时间将头撞在墙上,试图弄清楚为什么saveToEsWithMeta
在我使用字符串作为ID时不起作用:
rdd.map(caseClassContainingJson =>
(caseClassContainingJson._idWhichIsAString, caseClassContainingJson.jsonString)
)
.saveToEsWithMeta(s"$nationalShapeIndexName/$nationalShapeIndexType", Map(
ES_INPUT_JSON -> true.toString
))
这将引发与JSON解析相关的错误,从而欺骗您使您认为问题出在JSON上,但随后您记录了每个JSON,并发现它们都是有效的。
事实证明,无论出于何种原因,ES_INPUT_JSON -> true
都会使元组的左侧(即ID)也被解析为JSON!
解决方案,JSON将ID字符串化(将ID用多余的双引号引起来),以便将其解析为JSON即可:
rdd.map(caseClassContainingJson =>
(
Json.stringify(JsString(caseClassContainingJson._idWhichIsAString)),
caseClassContainingJson.jsonString
)
)
.saveToEsWithMeta(s"$nationalShapeIndexName/$nationalShapeIndexType", Map(
ES_INPUT_JSON -> true.toString
))
答案 4 :(得分:0)
saveToEs
来定义customer_id,而不必保存customer_id RDD[Map]
类型val rdd:RDD[Map[String, Any]]=...
val cfg = Map(
("es.mapping.id", your_customer_id),
("es.mapping.exclude", your_customer_id)
)
EsSpark.saveToEs(rdd, your_es_index, cfg)