Spark:从Elasticsearch读取嵌套数组

时间:2018-04-17 07:28:04

标签: scala apache-spark elasticsearch

我在Elasticsearch中有数据要与Spark一起使用。问题是我的Elasticsearch文档包含数组类型。

以下是我的Elasticsearch数据示例:

{  
   "took":4,
   "timed_out":false,
   "_shards":{  
      "total":36,
      "successful":36,
      "skipped":0,
      "failed":0
   },
   "hits":{  
      "total":2586638,
      "max_score":1,
      "hits":[  
         {  
            "_index":"Index_Name",
            "_type":"Type_Name",
            "_id":"l-hplmIBgpUzwNjPutjY",
            "_score":1,
            "_source":{  
               "currentTime":1518339120000,
               "location":{  
                  "lat":25.13,
                  "lon":55.18
               },
               "radius":65.935,
               "myArray":[  
                  {  
                     "id":"1154",
                     "field2":8,
                     "field3":16.39,
                     "myInnerArray":[  
                        [  
                           55.18,
                           25.13
                        ],
                        [  
                           55.18,
                           25.13
                        ],
                        ...
                     ]
                  }
               ],
               "field4":0.512,
               "field5":123.47,
               "time":"2018-02-11T08:52:00+0000"
            }
         },
         {  
            "_index":"Index_Name",
            "_type":"Type_Name",
            "_id":"4OhplmIBgpUzwNjPutjY",
            "_score":1,
            "_source":{  
               "currentTime":1518491400000,
               "location":{  
                  "lat":25.16,
                  "lon":55.22
               },
               "radius":6.02,
               "myArray":[  
                  {  
                     "id":"1158",
                     "field2":14,
                     "field3":32.455,
                     "myInnerArray":[  
                        [  
                           55.227,
                           25.169
                        ],
                        [  
                           55.2277,
                           25.169
                        ],
                       ...
                     ]
                  }
               ],
               "field4":0.5686,
               "field5":11.681,
               "time":"2018-02-13T03:10:00+0000"
            }
         },
         ...
      ]
   }
}

我设法使用以下代码查询Elasticsearch:

val df= spark.read.format("org.elasticsearch.spark.sql")
             // Some options
             .option("es.read.field.exclude","myArray")
             .option("es.query", DSL_QUERY)
             .load("Index_Name/Type_Name")

返回一个包含除我的数组之外的所有数据的Dataframe。 我现在想要获得包含数组的所有数据的Dataframe。 我试过这个:

val df= spark.read.format("org.elasticsearch.spark.sql")
        // Some options
        .option("es.read.field.as.array.include","myArray")
        .option("es.query", DSL_QUERY)
        .load("Index_Name/Type_Name")

但是我收到以下错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 1389, 10.139.64.5, executor 0): java.lang.ClassCastException: scala.collection.convert.Wrappers$JListWrapper cannot be cast to java.lang.Float
    at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:109)
    at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getFloat(rows.scala:43)
    at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getFloat(rows.scala:194)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:423)
    at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
    at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
    at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:110)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

我错过了什么?

编辑:

问题似乎来自嵌套数组。 如果我添加选项

.option("es.read.field.as.array.include","myArray")

字段myArray被识别为数组,但不是'myInnerArray' 所以我添加了

.option("es.read.field.as.array.include","myArray.myInnerArray")

这一次,'myInnerArray'被识别为一个数组,但不是'myArray'。

1 个答案:

答案 0 :(得分:0)

第二个选项似乎覆盖了第一个选项,因为您将它们分成了两行。

尝试将它们合并为一行,如下所示,

.option("es.read.field.as.array.include","myArray,myArray.myInnerArray")