我有一个spark kafka流媒体工作。下面是主要的作业处理逻辑。
val processedStream = rawStream.transform(x => {
var offsetRanges = x.asInstanceOf[HasOffsetRanges].offsetRanges;
val spark = SparkSession.builder.config(x.sparkContext.getConf).getOrCreate();
val parsedRDD = x.map(cr => cr.value());
var df = spark.sqlContext.read.schema(KafkaRawEvent.getStructure()).json(parsedRDD);
// Explode Events array as individual Event
if (DFUtils.hasColumn(df, "events")) {
// Rename the dow and hour
if (DFUtils.hasColumn(df, "dow"))
df = df.withColumnRenamed("dow", "hit-dow");
if (DFUtils.hasColumn(df, "hour"))
df = df.withColumnRenamed("hour", "hit-hour");
df = df
.withColumn("event", explode(col("events")))
.drop("events");
if (DFUtils.hasColumn(df, "event.key")) {
df = df.select(
"*", "event.key",
"event.count", "event.hour",
"event.dow", "event.sum",
"event.timestamp",
"event.segmentation");
}
if (DFUtils.hasColumn(df, "key")) {
df = df.filter("key != '[CLY]_view'");
}
df = df.select("*", "segmentation.*")
.drop("segmentation")
.drop("event");
if (DFUtils.hasColumn(df, "metrics")) {
df = df.select("*", "metrics.*").drop("metrics");
}
df = df.withColumnRenamed("timestamp", "eventTimeString");
df = df.withColumn("eventtimestamp", df("eventTimeString").cast(LongType).divide(1000).cast(TimestampType).cast(DateType))
.withColumn("date", current_date());
if (DFUtils.hasColumn(df, "restID")) {
df = df.join(broadcast(restroCached), df.col("restID") === restro.col("main_r_id"), "left_outer");
}
val SAVE_PATH = Conf.getSavePath();
//Write dataframe to file
df.write.partitionBy("date").mode("append").parquet(SAVE_PATH);
// filter out app launch events and group by adId and push to kafka
val columbiaDf = df.filter(col("adId").isNotNull)
.select(col("adId")).distinct().toDF("cui").toJSON;
// push columbia df to kafka for further processing
columbiaDf.foreachPartition(partitionOfRecords => {
val factory = columbiaProducerPool.value;
val producer = factory.getOrCreateProducer();
partitionOfRecords.foreach(record => {
producer.send(record);
});
});
df.toJSON
.foreachPartition(partitionOfRecords => {
val factory = producerPool.value;
val producer = factory.getOrCreateProducer();
partitionOfRecords.foreach(record => {
producer.send(record);
});
});
}
rawStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges);
df.toJSON.rdd;
});
val windowOneHourEveryMinute = processedStream.window(Minutes(60), Seconds(60));
windowOneHourEveryMinute.foreachRDD(windowRDD => ({
val prefix = Conf.getAnalyticsPrefixesProperties().getProperty("view.all.last.3.hours");
val viewCount = spark.sqlContext.read.schema(ProcessedEvent.getStructure()).json(windowRDD)
.filter("key == 'RestaurantView'")
.groupBy("restID")
.count()
.rdd.map(r => (prefix + String.valueOf(r.get(0)), String.valueOf(r.get(1))));
spark.sparkContext.toRedisKV(viewCount, Conf.getRedisKeysTTL());
}));
streamingContext.start();
streamingContext.awaitTermination();
这项工作已经运行了近一个月没有一次失败现在突然处理时间开始呈指数增长,尽管没有事件被处理。
我无法弄清楚它为什么会发生。下面我附上了应用程序大师的截图。
以下是处理时间的图表
来自spark UI中的jobs标签。大部分时间都花在.rdd.map(r => (prefix + String.valueOf(r.get(0)), String.valueOf(r.get(1))));
以下是舞台的DAG
这只是DAG的一小部分。实际的DAG非常大,但是这些相同的任务重复到窗口中的rdds总数,即我运行一批1分钟,因此30分钟的窗口有30个相同的重复任务。
是否有任何具体原因导致处理时间突然以指数方式开始?
Spark版本:2.2.0 Hadoop版本:2.7.3
注意:我在emr 5.8集群上运行此作业,其中1个驱动程序为2.5G,1个执行程序为3.5GB。