Question

我正在运行一个批量为1小时的Spark Streaming应用程序来连接两个数据源并将输出写入磁盘。一个数据源的总大小约为每小时40 GB（分成多个文件），而第二个数据源的大小约为每小时600-800 MB（也分成多个文件）。由于应用程序的限制，我可能无法运行较小的批次。目前，在具有140个内核和700 GB RAM的群集中生成输出大约需要20分钟。我正在运行7个工作人员和28个执行器，每个执行器有5个内核和22 GB RAM。

我在40 GB数据Feed上执行mapToPair（），filter（）和reduceByKeyAndWindow（1小时批处理）。大部分计算时间都花在这些操作上。令我担心的是每个遗嘱执行人的垃圾收集（GC）执行时间，从25秒到9.2分钟。我附上了两个截图：一个列出了GC时间，一个打印出单个执行器的GC注释。我预计执行垃圾收集花费9.2分钟的执行程序最终会被Spark驱动程序杀死。

我认为这些数字太高了。你有什么关于保持GC时间低的建议吗？我已经在使用Kryo Serializer，++ UseConcMarkSweepGC和spark.rdd.compress = true。

还有什么可以帮助吗？

修改

这是我的代码片段：

// The data feed is then mapped to a Key/Value RDD. Some string in the original RDD Stream will be filtered out according to the business logic 
JavaPairDStream<String, String> filtered_data = orig_data.mapToPair(parserFunc)
                                                   .filter(new Function<scala.Tuple2<String, String>, Boolean>() {
                                                     @Override
                                                     public Boolean call(scala.Tuple2<String, String> t) {
                                                       return (!t._2().contains("null"));
                                                     }
                                                   });

// WINDOW_DURATION = 2 hours, SLIDE_DURATION = 1 hour. The data feed will be later joined with another feed. 
// These two feeds are asynchronous: records in the second data feed may match records that appeared in the first data feed up to 2 hours before. 
// I need to save RDDs of the first data feed because they may be joined later. 
// I'm using reduceByKeyAndWindow() instead of window() because I can build this "cache" incrementally. 
// For a given key, appendString() simply appends new a new string to the value, while removeString() removes the strings (i.e. parts of the values) that go out of scope (i.e. out of  WINDOW_INTERVAL)
JavaPairDStream<String, String> windowed_data = filtered_data.reduceByKeyAndWindow(appendString, removeString, Durations.seconds(WINDOW_DURATION), Durations.seconds(SLIDE_DURATION))
                                                 .flatMapValues(new Function<String, Iterable<String>>() {
                                                   @Override
                                                   public Iterable<String> call(String s) {
                                                     return Arrays.asList(s.split(","));
                                                   }
                                                 });

// This is a second data feed, which is also transformed to a K/V RDD for the join operation with the first feed
JavaDStream<String> second_stream = jssc.textFileStream(MSP_DIR);

JavaPairDStream<String, String> ss_kv = second_stream.mapToPair(new PairFunction<String, String, String>() {
                                                @Override
                                                public scala.Tuple2<String, String> call(String row) {
                                                  String[] el = row.split("\\|");
                                                  return new scala.Tuple2(el[9], row);
                                                }
                                              });


JavaPairDStream<String, String> joined_stream = ss_kv.join(windowed_data)

// Use foreachRDD() to save joined_stream to HDFS

Spark Streaming - 漫长的垃圾收集时间

0 个答案: