使用pyspark查询Elasticsearch索引:如何指定es.nodes?

时间:2017-01-23 15:33:17

标签: python elasticsearch pyspark

我试图用pyspark查询Elasticsearch索引但没有成功:

] ./bin/pyspark --driver-class-path=jars/elasticsearch-hadoop-2.2.0.jar

在ipython中,spark版本2.0.1:

In [1]: es_read_conf = { "es.resource" : "test/docs" , "es.nodes" : ["xx.xx.xx.aa","xx.xx.xx.bb","xx.xx.xx.cc"],"es.port" : "9200", "es.net.http.auth.user": "myusername", "es.net.http.auth.pass": "mypassword"}
es_rdd = sc.newAPIHadoopRDD(inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_read_conf)

我收到以下错误:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String

似乎将es.nodes的python列表转换为Java字符串存在问题。我尝试使用仅包含我的elasticsearch主节点地址的字符串(" xx.xx.xx.aa")但我收到另一个错误:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on [test/docs] failed; server[xx.xx.xx.bb:9202] returned [502|Bad Gateway:]

有时错误是指数据节点bb,有时是指cc。有趣的是,如果我多次运行相同的命令,就会发生我没有错误的错误(可能只是在查询仅针对主节点运行时?)。使用localhost作为唯一的es.nodes运行命令没有问题。

1 个答案:

答案 0 :(得分:0)

请参阅ES-hadoop文档enter link description here

你需要设置以下属性来激发conf:

conf.set("es.nodes","<your host>")