Spark kafka流处理时间呈指数级增长

时间:2017-09-05 17:59:03

标签: hadoop apache-spark apache-kafka spark-streaming emr

我有一个spark kafka流媒体工作。下面是主要的作业处理逻辑。

val processedStream = rawStream.transform(x => {
var offsetRanges = x.asInstanceOf[HasOffsetRanges].offsetRanges;
val spark = SparkSession.builder.config(x.sparkContext.getConf).getOrCreate();
val parsedRDD = x.map(cr => cr.value());

var df = spark.sqlContext.read.schema(KafkaRawEvent.getStructure()).json(parsedRDD);

// Explode Events array as individual Event
if (DFUtils.hasColumn(df, "events")) {
    // Rename the dow and hour
    if (DFUtils.hasColumn(df, "dow"))
        df = df.withColumnRenamed("dow", "hit-dow");
    if (DFUtils.hasColumn(df, "hour"))
        df = df.withColumnRenamed("hour", "hit-hour");

    df = df
        .withColumn("event", explode(col("events")))
        .drop("events");

    if (DFUtils.hasColumn(df, "event.key")) {
        df = df.select(
            "*", "event.key",
            "event.count", "event.hour",
            "event.dow", "event.sum",
            "event.timestamp",
            "event.segmentation");
    }

    if (DFUtils.hasColumn(df, "key")) {
        df = df.filter("key != '[CLY]_view'");
    }

    df = df.select("*", "segmentation.*")
        .drop("segmentation")
        .drop("event");

    if (DFUtils.hasColumn(df, "metrics")) {
        df = df.select("*", "metrics.*").drop("metrics");
    }

    df = df.withColumnRenamed("timestamp", "eventTimeString");
    df = df.withColumn("eventtimestamp", df("eventTimeString").cast(LongType).divide(1000).cast(TimestampType).cast(DateType))
        .withColumn("date", current_date());


    if (DFUtils.hasColumn(df, "restID")) {
        df = df.join(broadcast(restroCached), df.col("restID") === restro.col("main_r_id"), "left_outer");
    }

    val SAVE_PATH = Conf.getSavePath();

    //Write dataframe to file
    df.write.partitionBy("date").mode("append").parquet(SAVE_PATH);

    // filter out app launch events and group by adId and push to kafka
    val columbiaDf = df.filter(col("adId").isNotNull)
        .select(col("adId")).distinct().toDF("cui").toJSON;

    // push columbia df to kafka for further processing
    columbiaDf.foreachPartition(partitionOfRecords => {
        val factory = columbiaProducerPool.value;
        val producer = factory.getOrCreateProducer();
        partitionOfRecords.foreach(record => {
            producer.send(record);
        });
    });

    df.toJSON
        .foreachPartition(partitionOfRecords => {
            val factory = producerPool.value;
            val producer = factory.getOrCreateProducer();
            partitionOfRecords.foreach(record => {
                producer.send(record);
            });
        });
}
rawStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges);
df.toJSON.rdd;
});

val windowOneHourEveryMinute = processedStream.window(Minutes(60), Seconds(60));

windowOneHourEveryMinute.foreachRDD(windowRDD => ({
val prefix = Conf.getAnalyticsPrefixesProperties().getProperty("view.all.last.3.hours");

val viewCount = spark.sqlContext.read.schema(ProcessedEvent.getStructure()).json(windowRDD)
    .filter("key == 'RestaurantView'")
    .groupBy("restID")
    .count()
    .rdd.map(r => (prefix + String.valueOf(r.get(0)), String.valueOf(r.get(1))));
spark.sparkContext.toRedisKV(viewCount, Conf.getRedisKeysTTL());
}));

streamingContext.start();
streamingContext.awaitTermination();

这项工作已经运行了近一个月没有一次失败现在突然处理时间开始呈指数增长,尽管没有事件被处理。

我无法弄清楚它为什么会发生。下面我附上了应用程序大师的截图。

Batches Batches

以下是处理时间的图表

Graph

来自spark UI中的jobs标签。大部分时间都花在.rdd.map(r => (prefix + String.valueOf(r.get(0)), String.valueOf(r.get(1))));

行上

Jobs

以下是舞台的DAG

DAG

这只是DAG的一小部分。实际的DAG非常大,但是这些相同的任务重复到窗口中的rdds总数,即我运行一批1分钟,因此30分钟的窗口有30个相同的重复任务。

是否有任何具体原因导致处理时间突然以指数方式开始?

Spark版本:2.2.0 Hadoop版本:2.7.3

注意:我在emr 5.8集群上运行此作业,其中1个驱动程序为2.5G,1个执行程序为3.5GB。

0 个答案:

没有答案