我们有一个火花流应用程序(以下是代码),它从kafka中获取数据并在将数据插入MongoDB之前进行一些转换(在每条消息上)。我们有一个中间件应用程序将消息(批量)推送到Kafka并等待来自spark流应用程序的确认(对于每个消息)。如果在将消息发送到Kafka之后的某个时间段(5秒)内中间件未收到确认,则中间件应用程序重新发送该消息。火花流应用程序能够接收大约50-100条消息(一批)并在5秒内发送所有消息的确认。但是,如果中间件应用程序推送超过100条消息,则由于发送确认的火花流延迟,导致中间件应用程序重新发送消息。在我们当前的实现中,我们每次要发送确认时都会创建生成器,这需要3-4秒。
package com.testing
import org.apache.spark.streaming._
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.streaming.kafka._
import org.apache.spark.sql.{ SQLContext, Row, Column, DataFrame }
import java.util.HashMap
import org.apache.kafka.clients.producer.{ KafkaProducer, ProducerConfig, ProducerRecord }
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.joda.time._
import org.joda.time.format._
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
import com.mongodb.util.JSON
import scala.io.Source._
import java.util.Properties
import java.util.Calendar
import scala.collection.immutable
import org.json4s.DefaultFormats
object Sample_Streaming {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Sample_Streaming")
.setMaster("local[4]")
val sc = new SparkContext(sparkConf)
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val props = new HashMap[String, Object]()
val bootstrap_server_config = "127.0.0.100:9092"
val zkQuorum = "127.0.0.101:2181"
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val TopicMap = Map("sampleTopic" -> 1)
val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)
val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
.option("spark.mongodb.input.uri", "connectionURI")
.option("spark.mongodb.input.collection", "schemaCollectionName")
.load()
val outSchema = schemaDf.schema
var outDf = sqlContext.createDataFrame(sc.emptyRDD[Row], outSchema)
KafkaDstream.foreachRDD(rdd => rdd.collect().map { x =>
{
val jsonInput: JValue = parse(x)
/*Do all the transformations using Json libraries*/
val json4s_transformed = "transformed json"
val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
val df = sqlContext.read.schema(outSchema).json(rdd)
df.write.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val producer = new KafkaProducer[String, String](props)
val message = new ProducerRecord[String, String]("topic_name", null, "message_received")
producer.send(message)
producer.close()
}
}
)
// Run the streaming job
ssc.start()
ssc.awaitTermination()
}
}
因此,我们尝试了另一种在foreachRDD之外创建生成器的方法,并将其重用于整个批处理间隔(以下是代码)。这似乎有所帮助,因为我们每次要发送确认时都没有创建生产者。但出于某种原因,当我们在spark UI上监视应用程序时,流式应用程序的内存消耗正在稳步增加,而事实并非如此。我们尝试在spark-submit中使用--num-executors 1选项来限制由yarn启动的执行程序的数量。
object Sample_Streaming {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Sample_Streaming")
.setMaster("local[4]")
val sc = new SparkContext(sparkConf)
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val props = new HashMap[String, Object]()
val bootstrap_server_config = "127.0.0.100:9092"
val zkQuorum = "127.0.0.101:2181"
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val TopicMap = Map("sampleTopic" -> 1)
val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)
val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
.option("spark.mongodb.input.uri", "connectionURI")
.option("spark.mongodb.input.collection", "schemaCollectionName")
.load()
val outSchema = schemaDf.schema
val producer = new KafkaProducer[String, String](props)
KafkaDstream.foreachRDD(rdd =>
{
rdd.collect().map ( x =>
{
val jsonInput: JValue = parse(x)
/*Do all the transformations using Json libraries*/
val json4s_transformed = "transformed json"
val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
val df = sqlContext.read.schema(outSchema).json(rdd)
df.write.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val message = new ProducerRecord[String, String]("topic_name", null, "message_received")
producer.send(message)
producer.close()
}
)
}
)
// Run the streaming job
ssc.start()
ssc.awaitTermination()
}
}
我的问题是:
如何监控火花应用程序的内存消耗,目前我们每隔5分钟手动监控一次应用程序,直到它耗尽我们群集中的可用内存(每个节点16GB)?
使用Spark流媒体和kafka时,业界遵循的最佳做法是什么?
答案 0 :(得分:3)
Kafka是一名经纪人:它为您提供生产者和消费者的交付保证。实施“超过”顶级'是否过度杀伤?承认生产者和消费者之间的机制。确保生产者行为正常,消费者可以在发生故障时恢复,并确保终端2端交付。
关于这项工作,难怪为什么它的性能很差:处理按顺序进行,逐个元素直到写入外部数据库。这是普通错误,在尝试解决任何内存消耗问题之前应该先解决。
这个过程可以改进如下:
ignore
如果我们可以用val producer = // create producer
val jsonDStream = kafkaDstream.transform{rdd => rdd.map{elem =>
val json = parse(elem)
render(doAllTransformations(json)) // output should be a String-formatted JSON object
}
}
jsonDStream.foreachRDD{ rdd =>
val df = sqlContext.read.schema(outSchema).json(rdd) // transform the complete collection, not element by element
df.write.option("spark.mongodb.output.uri", "connectionURI") // write in bulk, not one by one
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val msg = //create message
producer.send(msg)
producer.flush() // force send. *DO NOT Close* otherwise it will not be able to send any more messages
}
实例替换所有以字符串为中心的JSON转换,则可以进一步改进此过程。