使用pyspark和elasticsearch-hadoop连接器的elasticsearch查询在RecordReader.close

时间:2016-04-26 21:10:11

标签: hadoop elasticsearch pyspark

从elasticsearch读入rdd抛出异常:

org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest:ActionRequestValidationException [验证失败:1:未指定滚动ID;]

mr.EsInputFormat:无法确定任务ID ...

软件版本:pyspark 1.6,elasticsearch-hadoop-2.2.1连接器用作elasticsearch的连接器,Elasticsearch版本是1.0.1,hadoop 2.7.2和python 2.7

elasticsearch-hadoop-2.2.1库来自这里: https://www.elastic.co/guide/en/elasticsearch/hadoop/2.2/reference.html

代码

es_rdd =  sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf={ "es.resource" : "INDEX/TYPE", "es.nodes" : "NODE_NAME"})
print (es_rdd.first())

请帮助解决异常;

在异常之前打印以下警告,可以链接到此警告以及可能的实际异常: mr.EsInputFormat:无法确定任务ID ...

INFO Configuration.deprecation:不推荐使用mapred.tip.id。相反,请使用mapreduce.task.id

完全例外

16/04/26 21:00:02 INFO rdd.NewHadoopRDD:输入分割:ShardInputSplit [node = [KHHV8pgMQySzw9Fz1Xt7VQ / Iguana | 135.17.42.49:9200],shard = 0] 16/04/26 21:00:02 INFO Configuration.deprecation:不推荐使用mapred.mapoutput.value.class。而是使用mapreduce.map.output.value.class 16/04/26 21:00:02 INFO Configuration.deprecation:不推荐使用mapred.task.id。而是使用mapreduce.task.attempt.id 16/04/26 21:00:02 INFO Configuration.deprecation:不推荐使用mapred.tip.id。相反,请使用mapreduce.task.id

16/04/26 19:31:12 WARN mr.EsInputFormat:无法确定任务ID ... 16/04/26 19:31:14 WARN rdd.NewHadoopRDD:RecordReader.close()中的异常 org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest:ActionRequestValidationException [验证失败:1:未指定滚动ID;]         在org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:478)         at org.elasticsearch.hadoop.rest.RestClient.executeNotFoundAllowed(RestClient.java:449)         at org.elasticsearch.hadoop.rest.RestClient.deleteScroll(RestClient.java:512)         在org.elasticsearch.hadoop.rest.ScrollQuery.close(ScrollQuery.java:70)         at org.elasticsearch.hadoop.mr.EsInputFormat $ ShardRecordReader.close(EsInputFormat.java:262)         在org.apache.spark.rdd.NewHadoopRDD $$ anon $ 1.org $ apache $ spark $ rdd $ NewHadoopRDD $$ anon $$ close(NewHadoopRDD.scala:191)         在org.apache.spark.rdd.NewHadoopRDD $$ anon $ 1.hasNext(NewHadoopRDD.scala:166)         在org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)         在scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:327)         在scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:327)         在org.apache.spark.api.python.SerDeUtil $ AutoBatchedPickler.next(SerDeUtil.scala:118)         在org.apache.spark.api.python.SerDeUtil $ AutoBatchedPickler.next(SerDeUtil.scala:110)         在scala.collection.Iterator $ class.foreach(Iterator.scala:727)         在org.apache.spark.api.python.SerDeUtil $ AutoBatchedPickler.foreach(SerDeUtil.scala:110)         在org.apache.spark.api.python.PythonRDD $ .writeIteratorToStream(PythonRDD.scala:452)         在org.apache.spark.api.python.PythonRunner $ WriterThread $$ anonfun $ run $ 3.apply(PythonRDD.scala:280)         在org.apache.spark.util.Utils $ .logUncaughtExceptions(Utils.scala:1741)         在org.apache.spark.api.python.PythonRunner $ WriterThread.run(PythonRDD.scala:239)

谢谢!

0 个答案:

没有答案