我在Elasticsearch中有数据要与Spark一起使用。问题是我的Elasticsearch文档包含数组类型。
以下是我的Elasticsearch数据示例:
{
"took":4,
"timed_out":false,
"_shards":{
"total":36,
"successful":36,
"skipped":0,
"failed":0
},
"hits":{
"total":2586638,
"max_score":1,
"hits":[
{
"_index":"Index_Name",
"_type":"Type_Name",
"_id":"l-hplmIBgpUzwNjPutjY",
"_score":1,
"_source":{
"currentTime":1518339120000,
"location":{
"lat":25.13,
"lon":55.18
},
"radius":65.935,
"myArray":[
{
"id":"1154",
"field2":8,
"field3":16.39,
"myInnerArray":[
[
55.18,
25.13
],
[
55.18,
25.13
],
...
]
}
],
"field4":0.512,
"field5":123.47,
"time":"2018-02-11T08:52:00+0000"
}
},
{
"_index":"Index_Name",
"_type":"Type_Name",
"_id":"4OhplmIBgpUzwNjPutjY",
"_score":1,
"_source":{
"currentTime":1518491400000,
"location":{
"lat":25.16,
"lon":55.22
},
"radius":6.02,
"myArray":[
{
"id":"1158",
"field2":14,
"field3":32.455,
"myInnerArray":[
[
55.227,
25.169
],
[
55.2277,
25.169
],
...
]
}
],
"field4":0.5686,
"field5":11.681,
"time":"2018-02-13T03:10:00+0000"
}
},
...
]
}
}
我设法使用以下代码查询Elasticsearch:
val df= spark.read.format("org.elasticsearch.spark.sql")
// Some options
.option("es.read.field.exclude","myArray")
.option("es.query", DSL_QUERY)
.load("Index_Name/Type_Name")
返回一个包含除我的数组之外的所有数据的Dataframe。 我现在想要获得包含数组的所有数据的Dataframe。 我试过这个:
val df= spark.read.format("org.elasticsearch.spark.sql")
// Some options
.option("es.read.field.as.array.include","myArray")
.option("es.query", DSL_QUERY)
.load("Index_Name/Type_Name")
但是我收到以下错误:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 1389, 10.139.64.5, executor 0): java.lang.ClassCastException: scala.collection.convert.Wrappers$JListWrapper cannot be cast to java.lang.Float
at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:109)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getFloat(rows.scala:43)
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getFloat(rows.scala:194)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:423)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
我错过了什么?
编辑:
问题似乎来自嵌套数组。 如果我添加选项
.option("es.read.field.as.array.include","myArray")
字段myArray被识别为数组,但不是'myInnerArray' 所以我添加了
.option("es.read.field.as.array.include","myArray.myInnerArray")
这一次,'myInnerArray'被识别为一个数组,但不是'myArray'。
答案 0 :(得分:0)
第二个选项似乎覆盖了第一个选项,因为您将它们分成了两行。
尝试将它们合并为一行,如下所示,
.option("es.read.field.as.array.include","myArray,myArray.myInnerArray")