Spark Streaming(Spark 2.x)遇到性能问题。
我的用例是:
Spark配置为在Yarn上运行(6个工作线程:每个节点32个vcore和256 GB RAM-这些节点上的180GB专用于纱线)。 Kudu安装在相同的节点上,每个TabletServer的硬限制为40 GB。 我的火花流是从模式yarn-client的边缘节点提交的,周期为5秒,有12个执行程序(10 GB内存和1个核心)。
当连接的少于300.000个设备并每5秒发送一条消息时,一切正常。 连接了300.000多个设备并每5秒发送一条消息后,Spark开始做一些我听不懂的奇怪事情(在一两个小时后)。 奇怪的是,我的意思是:
是说: -一切正常时小于1秒的平均调度延迟。 -有时,当事情变得更加复杂时,某些调度延迟会增加到1分钟,然后又回到一些更常见的值 -1-2小时后,由于数据不再可用,调度延迟变得疯狂起来,一直持续到克服kafka保留期并引发错误为止。
我以为Spark会请求更多资源,但是Yarn控制台无法确认:我只使用了50%的可用内存,而少于总vcore的20%。
所以。.这是怎么回事?我在任何日志中都看不到任何东西,什么也没有。看来我的过程无缘无故卡住了(我确定这是一个非常合乎逻辑的...但是我找不到)。当然,从这一刻起,没有其他任何东西可以写到Kudu了。
另一件事:当我的设备停止发送消息时,我的spark应用程序会在摄取阶段保留它要求的所有51个容器,而不会释放它们。正常吗我可以告诉火花释放他们吗?怎么样?
据我了解的日志,这是在浪费时间:
18/07/12 07:18:26 INFO kafka010.CachedKafkaConsumer: Initial fetch for spark-executor-SessionSubscriber heartbeat 5 1272330574
18/07/12 07:19:07 INFO executor.Executor: Finished task 5.0 in stage 149.0 (TID 4493). 3831 bytes result sent to driver
当一切都很好时,此时花费的时间要短得多:
18/07/12 07:18:11 INFO kafka010.CachedKafkaConsumer: Initial fetch for spark-executor-SessionSubscriber heartbeat 1 1270772473
18/07/12 07:18:12 INFO executor.Executor: Finished task 18.0 in stage 141.0 (TID 4245). 3831 bytes result sent to driver
18/07/12 07:19:07 INFO kafka010.CachedKafkaConsumer: Initial fetch for spark-executor-SessionSubscriber heartbeat 5 1272344278
18/07/12 07:19:08 INFO executor.Executor: Finished task 5.0 in stage 153.0 (TID 4595). 3831 bytes result sent to driver
Kudu呢? Kudu可能存在一些问题,但是据我了解Kudu的工作原理,在将数据写入磁盘之前,先将数据刷新到Kudu TS内存中,并且此操作不会阻止Spark Streaming作业。我想我一定会从Kudu那里得到警告,告诉我出事了...
那我的Spark代码呢?在这里...
// Create Spark Conf
val conf: SparkConf = new SparkConf()
conf.set("spark.streaming.concurrentJobs", "2")
conf.setAppName("Audience")
conf.setMaster(master)
conf.setSparkHome(sparkHome)
// Create Spark Session
// **********
val spark: SparkSession = SparkSession.builder()
.config(conf)
.getOrCreate()
// Create Spark Streaming Context
// **********
val ssc: StreamingContext = StreamingContext.getActiveOrCreate(
() => new StreamingContext(
spark.sparkContext,
duration))
def main(args: Array[String]) {
// Start the job
// **********
subscribe
startEtl
}
def subscribe {
// Create Kudu Context
// **********
val kudu: KuduContext = new KuduContext(kuduMasters, spark.sparkContext);
// Subscribe to Kafka
// **********
// Topics to be subscribed
val topicSetSession: Array[String] = topicSession.split(",")
// Kafka subscriber configuration
val sessionKafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "SessionSubscriber",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean))
// Get Session Stream
val rawSessionStream = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topicSetSession, sessionKafkaParams))
// Pull the Kafka topic and process the stream
val processedSessionStream = ingestSession(rawSessionStream)
// Persist the session
persistSession(kudu, processedSessionStream)
}
/**
* Start the Spark ETL
*/
def startEtl {
// Start the Spark Streaming batch
ssc.start()
ssc.awaitTermination()
}
/**
* Close the spark session
*/
def close {
ssc.stop(true)
spark.close
}
/**
* Process the raw stream polled from Kafka and convert it into DStream
*
* @param rawStream Raw Stream polled from Kafka
*/
def ingestSession(rawSessionStream: InputDStream[ConsumerRecord[String, String]]): DStream[KuduSession] = {
var parsedSessionStream = rawSessionTvStream.map(record => KuduSession(record.value.toString.split('|')))
parsedSessionStream
}
/**
* Persist each record from the processed stream into a persistence Layer
*
* @param kuduContext Kudu context to be used to persist into Kudu
* @param processStream Processed stream of data to persist
*/
def persistSession(kuduContext: KuduContext, processedSessionStream: DStream[KuduSession]): Unit = {
// Process each record
import spark.implicits._
val newNames = Seq("session_id", "device_id", "channel_id", "real_channel_id",
"start_session_year", "start_session_month", "start_session_day",
"start_session_hour", "start_session_minute", "start_session_second",
"end_session_year", "end_session_month", "end_session_day",
"end_session_hour", "end_session_minute", "end_session_second")
processedSessionStream.foreachRDD(rdd => {
kuduContext.upsertRows(rdd.toDF(newNames: _*), "impala::devices.hbbtv_session")
})
}
谢谢您的帮助!