我试图用pyspark查询Elasticsearch索引但没有成功:
] ./bin/pyspark --driver-class-path=jars/elasticsearch-hadoop-2.2.0.jar
在ipython中,spark版本2.0.1:
In [1]: es_read_conf = { "es.resource" : "test/docs" , "es.nodes" : ["xx.xx.xx.aa","xx.xx.xx.bb","xx.xx.xx.cc"],"es.port" : "9200", "es.net.http.auth.user": "myusername", "es.net.http.auth.pass": "mypassword"}
es_rdd = sc.newAPIHadoopRDD(inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_read_conf)
我收到以下错误:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
似乎将es.nodes的python列表转换为Java字符串存在问题。我尝试使用仅包含我的elasticsearch主节点地址的字符串(" xx.xx.xx.aa")但我收到另一个错误:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on [test/docs] failed; server[xx.xx.xx.bb:9202] returned [502|Bad Gateway:]
有时错误是指数据节点bb,有时是指cc。有趣的是,如果我多次运行相同的命令,就会发生我没有错误的错误(可能只是在查询仅针对主节点运行时?)。使用localhost作为唯一的es.nodes运行命令没有问题。
答案 0 :(得分:0)