我需要从Kafka制作人处获取消息,并从我需要的消息中找到包含%
的单词,并为不同的%
值生成消息。最后,我需要将它发送给ElasticSearch。
我可以使用kafkaStream.print()
在控制台中查看值,但我需要处理字符串以匹配所需的关键字并生成消息。
我的代码:
package rnd
import org.apache.spark.SparkConf
import kafka.serializer.StringDecoder
import org.apache.spark.sql.SQLContext
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object WordFind {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("KafkaReceiver")
val checkpointDir = "/usr/local/kafka/kafka_2.11-0.11.0.2/checkpoint/"
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
val batchIntervalSeconds = 2
val ssc = new StreamingContext(conf, Seconds(10))
import org.apache.spark.streaming.kafka.KafkaUtils
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer-group", Map("wordcounttopic" -> 5))
val s = kafkaStream.print()
println(" the words are: " + s)
ssc.remember(Minutes(1))
ssc.checkpoint(checkpointDir)
ssc
ssc.start()
ssc.awaitTerminationOrTimeout(batchIntervalSeconds * 5 * 1000)
}
}
如果我通过"使用率为75%"通过Lafka制作人,我应该发出一条消息,说“"将ram提高25%"在ElasticSearch中。
我得到的输出是:
18/02/09 16:38:27 INFO BlockManagerMasterEndpoint: Registering block manager localhost:37879 with 2.4 GB RAM, BlockManagerId(driver, localhost, 37879)
18/02/09 16:38:27 INFO BlockManagerMaster: Registered BlockManager
18/02/09 16:38:27 WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
***the words are: ()***
我想要传递的字符串代替()'。
答案 0 :(得分:0)
val kafkaStream
是RecieverInputDStream[(String, String)]
,数据为(kafkaMetaData, kafkaMessage)
有关详细信息,请参阅[https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/KafkaInputDStream.scala#L135]。
我们需要提取元组的第二个并进行模式匹配(即过滤RecieverInputDStream找到包含%的单词),然后使用map生成输出(即不同%值的消息)。正如@stefanobaghino所提到的,print()函数只是将输出打印到控制台,并且不会返回记录的任何字符串。
例如:
import org.apache.spark.streaming.dstream.ReceiverInputDStream
val kafkaStream: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(sparkStreamingContext, "localhost:2181",
"spark-streaming-consumer-group", Map("wordcounttopic" -> 5))
import org.apache.spark.streaming.dstream.DStream
val filteredStream: DStream[(String, String)] = kafkaStream
.filter(record => record._2.contains("%")) // TODO : pattern matching here
val outputDStream: DStream[String] = filteredStream
.map(record => record._2.toUpperCase()) // just assuming some operation
outputDStream.print()
使用outputDStream写入ElasticSearch。希望这会有所帮助。