Pyspark使用saveAsNewAPIHadoopFile将DStream数据写入Elasticsearch

时间:2016-12-29 18:30:42

标签: elasticsearch apache-spark pyspark apache-kafka spark-streaming

我正在尝试将Kafka Stream转换为RDD并将这些RDD插入到Elasticsearch数据库中。这是我的代码:

private function getIncludes() {
    $vendorDir = $this->composerInstance->getConfig()->get('vendor-dir');
    require $vendorDir . '/autoload.php';
}

saveAsNewAPIHadoopFile函数应该将这些RDD写入ES。但是我收到了这个错误:

conf = SparkConf().setAppName("ola")
sc = SparkContext(conf=conf) 
es_write_conf = {
    "es.nodes": "localhost",
    "es.port": "9200",
    "es.resource": "pipe/word"
}

ssc = StreamingContext(sc, 2)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}) 
lines = kvs.map(lambda x: x[1])  
value_counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

value_counts.transform(lambda rdd: rdd.map(f))
value_counts.saveAsNewAPIHadoopFile(
    path='-',
    outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
    keyClass="org.apache.hadoop.io.NullWritable",
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
    conf=es_write_conf)

ssc.start()  
ssc.awaitTermination() 

转换函数应该能够将流转换为Spark数据帧。如何将这些RDD写入Elasticsearch?谢谢!

2 个答案:

答案 0 :(得分:0)

您可以使用foreachRDD

value_counts.foreachRDD(lambda rdd: rdd.saveAsNewAPIHadoopFile(...))

答案 1 :(得分:0)

new = rawUser.rdd.map(lambda item: ('key', {'id': item['entityId'],'targetEntityId': item['targetEntityId']}))
  

rawUser是DATAFRAME和   新的是PipelinedRDD

new.saveAsNewAPIHadoopFile(
    path='/home/aakash/test111/', 
    outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
    keyClass="org.apache.hadoop.io.NullWritable", 
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
    conf={ "es.resource" : "index/test" ,"es.mapping.id":"id","es.nodes" : "localhost","es.port" : "9200","es.nodes.wan.only":"false"})

这里最重要的是下载适当兼容的JAR https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-hadoop 检查弹性版本并下载适当的jar。

使pyspark使用jar的命令。 pyspark --jars elasticsearch-hadoop-6.2.4.jar