我想从spark中索引elasticsearch。抛出异常......
org.apache.spark.SparkException:作业因阶段失败而中止:阶段1.0中的任务0失败1次,最近失败:阶段1.0中失去的任务0.0(TID 1,localhost):java.lang.StringIndexOutOfBoundsException:字符串索引超出范围:-1 at java.lang.String.substring(String.java:1967) 在org.elasticsearch.hadoop.rest.RestClient.discoverNodes(RestClient.java:110) 在org.elasticsearch.hadoop.rest.InitializationUtils.discoverNodesIfNeeded(InitializationUtils.java:58) 在org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:372) 在org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40) 在org.elasticsearch.spark.rdd.EsSpark $$ anonfun $ saveToEs $ 1.apply(EsSpark.scala:67) 在org.elasticsearch.spark.rdd.EsSpark $$ anonfun $ saveToEs $ 1.apply(EsSpark.scala:67) 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 在org.apache.spark.scheduler.Task.run(Task.scala:88) 在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:214) 在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617) 在java.lang.Thread.run(Thread.java:745)
驱动程序堆栈跟踪: 在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1283) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1271) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1270) 在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:697) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:697) 在scala.Option.foreach(Option.scala:236) 在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48) 在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:1822) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:1835) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:1912) 在org.elasticsearch.spark.rdd.EsSpark $ .saveToEs(EsSpark.scala:67) 在org.elasticsearch.spark.rdd.EsSpark $ .saveToEs(EsSpark.scala:52) 在org.elasticsearch.spark.rdd.api.java.JavaEsSpark $ .saveToEs(JavaEsSpark.scala:54) 在org.elasticsearch.spark.rdd.api.java.JavaEsSpark.saveToEs(JavaEsSpark.scala) 在com.tgt.search.metrics.es.bulk.Sparkimporter.main(Sparkimporter.java:88) 引起:java.lang.StringIndexOutOfBoundsException:字符串索引超出范围:-1 at java.lang.String.substring(String.java:1967) 在org.elasticsearch.hadoop.rest.RestClient.discoverNodes(RestClient.java:110) 在org.elasticsearch.hadoop.rest.InitializationUtils.discoverNodesIfNeeded(InitializationUtils.java:58) 在org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:372) 在org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40) 在org.elasticsearch.spark.rdd.EsSpark $$ anonfun $ saveToEs $ 1.apply(EsSpark.scala:67) 在org.elasticsearch.spark.rdd.EsSpark $$ anonfun $ saveToEs $ 1.apply(EsSpark.scala:67) 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 在org.apache.spark.scheduler.Task.run(Task.scala:88) 在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:214) 在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617) 在java.lang.Thread.run(Thread.java:745)
这是我的代码......
SparkConf conf = new SparkConf().setMaster("local")
.setAppName("Indexer").set("spark.driver.maxResultSize", "2g");
conf.set("es.index.auto.create", "true");
conf.set("es.nodes", "localhost");
conf.set("es.port", "9200");
conf.set("es.write.operation", "index");
JavaSparkContext sc = new JavaSparkContext(conf);
Map<String, ?> numbers = ImmutableMap.of("one", 1, "two", 2);
Map<String, ?> airports = ImmutableMap.of("OTP", "Otopeni", "SFO", "San Fran");
JavaRDD<Map<String, ?>> javaRDD = jsc.parallelize(ImmutableList.of(doc1, doc2));
JavaEsSpark.saveToEs(javaRDD, "spark/docs");
我尝试在本地写入文件正常工作....这可能是配置中的问题。
这些是我的pom.xml中的依赖项
<dependencies>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>2.1.0</version>
</dependency>
<!-- <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-databind</artifactId>
<version>2.6.4</version> </dependency> -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.5.1</version>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark_2.10</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
答案 0 :(得分:1)
stacktrace中的相关行是:
java.lang.StringIndexOutOfBoundsException:字符串索引超出范围: -1在java.lang.String.substring(String.java:1967)at ...
此错误不是来自您的代码。它是由弹性搜索版本和您正在使用的elasticsearch-hadoop适配器之间的不兼容性引起的。版本2.0.x和2.1.x的elasticsearch-hadoop适配器仅与elasticsearch 1.x一起使用。我在弹性搜索2.1.1中遇到了同样的错误,不得不将弹性搜索版本降级到1.4.4。错误消失了。
通过costin here
查看答案