我正在尝试将Kafka Stream转换为RDD并将这些RDD插入到Elasticsearch数据库中。这是我的代码:
private function getIncludes() {
$vendorDir = $this->composerInstance->getConfig()->get('vendor-dir');
require $vendorDir . '/autoload.php';
}
saveAsNewAPIHadoopFile函数应该将这些RDD写入ES。但是我收到了这个错误:
conf = SparkConf().setAppName("ola")
sc = SparkContext(conf=conf)
es_write_conf = {
"es.nodes": "localhost",
"es.port": "9200",
"es.resource": "pipe/word"
}
ssc = StreamingContext(sc, 2)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
value_counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
value_counts.transform(lambda rdd: rdd.map(f))
value_counts.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_write_conf)
ssc.start()
ssc.awaitTermination()
转换函数应该能够将流转换为Spark数据帧。如何将这些RDD写入Elasticsearch?谢谢!
答案 0 :(得分:0)
您可以使用foreachRDD
:
value_counts.foreachRDD(lambda rdd: rdd.saveAsNewAPIHadoopFile(...))
答案 1 :(得分:0)
new = rawUser.rdd.map(lambda item: ('key', {'id': item['entityId'],'targetEntityId': item['targetEntityId']}))
rawUser是DATAFRAME和 新的是PipelinedRDD
new.saveAsNewAPIHadoopFile(
path='/home/aakash/test111/',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf={ "es.resource" : "index/test" ,"es.mapping.id":"id","es.nodes" : "localhost","es.port" : "9200","es.nodes.wan.only":"false"})
这里最重要的是下载适当兼容的JAR https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-hadoop 检查弹性版本并下载适当的jar。
使pyspark使用jar的命令。
pyspark --jars elasticsearch-hadoop-6.2.4.jar