Pyspark es.query仅在默认情况下有效

时间:2017-09-13 20:34:43

标签: hadoop apache-spark elasticsearch pyspark

在pypspark中,我可以从ES返回数据的唯一方法是保留es.query默认值。这是为什么?

es_query = {"match" : {"key" : "value"}}
es_conf = {"es.nodes" : "localhost", "es.resource" : "index/type", "es.query" : json.dumps(es_query)}
rdd = sc.newAPIHadoopRDD(inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",keyClass="org.apache.hadoop.io.NullWritable",valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_conf)
...
rdd.count()
0
rdd.first()
ValueError: RDD is empty

然而,这个查询(默认)似乎有效

es_query = {"match_all" : {}}
...
rdd.first()
(u'2017-01-01 23:59:59)

*我已经通过直接查询弹性搜索来测试查询,但它们的工作方式与spark / es-hadoop有关。

1 个答案:

答案 0 :(得分:0)

By default the API adds "query":{} in front of your actual query. For the elasticsearch the query you are sending will look like

"query" :{
"match" : {"key" : "value"}
}

which is not valid.