我正在运行一个批量为1小时的Spark Streaming应用程序来连接两个数据源并将输出写入磁盘。一个数据源的总大小约为每小时40 GB(分成多个文件),而第二个数据源的大小约为每小时600-800 MB(也分成多个文件)。由于应用程序的限制,我可能无法运行较小的批次。目前,在具有140个内核和700 GB RAM的群集中生成输出大约需要20分钟。我正在运行7个工作人员和28个执行器,每个执行器有5个内核和22 GB RAM。
我在40 GB数据Feed上执行mapToPair(),filter()和reduceByKeyAndWindow(1小时批处理)。大部分计算时间都花在这些操作上。令我担心的是每个遗嘱执行人的垃圾收集(GC)执行时间,从25秒到9.2分钟。我附上了两个截图:一个列出了GC时间,一个打印出单个执行器的GC注释。我预计执行垃圾收集花费9.2分钟的执行程序最终会被Spark驱动程序杀死。
我认为这些数字太高了。你有什么关于保持GC时间低的建议吗?我已经在使用Kryo Serializer,++ UseConcMarkSweepGC和spark.rdd.compress = true。
修改
这是我的代码片段:
// The data feed is then mapped to a Key/Value RDD. Some string in the original RDD Stream will be filtered out according to the business logic
JavaPairDStream<String, String> filtered_data = orig_data.mapToPair(parserFunc)
.filter(new Function<scala.Tuple2<String, String>, Boolean>() {
@Override
public Boolean call(scala.Tuple2<String, String> t) {
return (!t._2().contains("null"));
}
});
// WINDOW_DURATION = 2 hours, SLIDE_DURATION = 1 hour. The data feed will be later joined with another feed.
// These two feeds are asynchronous: records in the second data feed may match records that appeared in the first data feed up to 2 hours before.
// I need to save RDDs of the first data feed because they may be joined later.
// I'm using reduceByKeyAndWindow() instead of window() because I can build this "cache" incrementally.
// For a given key, appendString() simply appends new a new string to the value, while removeString() removes the strings (i.e. parts of the values) that go out of scope (i.e. out of WINDOW_INTERVAL)
JavaPairDStream<String, String> windowed_data = filtered_data.reduceByKeyAndWindow(appendString, removeString, Durations.seconds(WINDOW_DURATION), Durations.seconds(SLIDE_DURATION))
.flatMapValues(new Function<String, Iterable<String>>() {
@Override
public Iterable<String> call(String s) {
return Arrays.asList(s.split(","));
}
});
// This is a second data feed, which is also transformed to a K/V RDD for the join operation with the first feed
JavaDStream<String> second_stream = jssc.textFileStream(MSP_DIR);
JavaPairDStream<String, String> ss_kv = second_stream.mapToPair(new PairFunction<String, String, String>() {
@Override
public scala.Tuple2<String, String> call(String row) {
String[] el = row.split("\\|");
return new scala.Tuple2(el[9], row);
}
});
JavaPairDStream<String, String> joined_stream = ss_kv.join(windowed_data)
// Use foreachRDD() to save joined_stream to HDFS