Pyspark流式写入elasticsearch

时间:2017-11-10 16:00:35

标签: elasticsearch pyspark spark-streaming

有没有办法通过Spark Streaming编写,从kafka读取并在ElasticSearch中写入? 我尝试了类似的东西......正如ElasticSearch文档中所解释的那样(关于pyspark的文档很少):

sc = SparkContext("local[2]", appName="TwitterStreamKafka")
ssc = StreamingContext(sc, batchIntervalSeconds)
topic = url_topic

tweets = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
tweets.pprint()

conf = {"es.resource": "credentials/credential"}  # assume Elasticsearch is running on localhost defaults
if tweets.count() > 0:
    tweets.foreachRDD(lambda rdd: rdd.saveAsNewAPIHadoopFile(
    path='-',
    outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
    keyClass="org.apache.hadoop.io.NullWritable",
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
    conf=conf))

ssc.start()
ssc.awaitTermination()

但它不起作用。错误是:

17/11/10 17:16:35 ERROR Utils: Aborting task
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [127.0.0.1:9200] returned Bad Request(400) - failed to parse; Bailing out..
    at org.elasticsearch.hadoop.rest.RestClient.processBulkResponse(RestClient.java:251)
    at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:203)
    at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:222)
    at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:244)
    at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:269)
    at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.doClose(EsOutputFormat.java:214)
    at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.close(EsOutputFormat.java:196)
    at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:155)
    at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:144)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
    at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:159)
    at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:89)
    at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:88)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
17/11/10 17:16:35 ERROR SparkHadoopMapReduceWriter: Task attempt_20171110171633_0003_r_000000_0 aborted.
17/11/10 17:16:35 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
org.apache.spark.SparkException: Task failed while writing rows
    at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:178)
    at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:89)
    at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:88)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

这是我用来执行它的命令:

spark-submit --jars elasticsearch-hadoop-5.6.4.jar,spark-streaming-kafka-0-10-assembly_2.11-2.2.0.jar es_spark_write.py 

我正在使用spark 2.2.0 来自Kafka的消息是关键消息json,如下所示:

  

(u'urls', u'{"token": "secret_token", "count": 2443}')

0 个答案:

没有答案