了解Apache Spark过滤器转换行为

时间:2016-06-13 15:46:32

标签: java apache-spark filter

我有一个JavaRDD中的项目列表,其中每个项目都是一个日期(Java日历)。现在,我想过滤所有小于给定日期的日期。那是我的代码:

    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("Date comparison test")
            .setMaster("local[4]").set("spark.executor.memory", "1g");
        JavaSparkContext sc = new JavaSparkContext(conf);

        // initializes a filter date to 01/01/2016 at 10:00:00
        Calendar filterDate = Calendar.getInstace();
        filterDate.clear();
        filterDate.setTimeInMillis(1451642400000l);

        // initializes an array of 40 calendars, in which every date
        // is 1 hour later than the previous, starting from
        // 01/01/2016 at 08:00:00
        ArrayList<Calendar> calendarArray = new ArrayList<>();
        // milliseconds corresponding to 01/01/2016 at 08:00:00
        long initial = 1451635200000l;
        for(int i=0; i < 40; ++i) {
            Calendar one = Calendar.getInstace();
            one.clear();
            one.setTimeInMillis(initial);
            calendarArray.add(one);
            initial += 3600000;
        }

        JavaRDD<Calendar> rdd = sc.parallelize(calendarArray);
        JavaRDD<Calendar> rddFiltered = rdd.filter(new FilterTest(filterDate));
        System.out.println("RDD SIZE " + rddFiltered.count());
        sc.close();
}

FilterTest代码

public class FilterTest implements Function<Calendar, Boolean> {

private static final long serialVersionUID = -3134317182912968444L;
private final Calendar filteringDate;

public FilterTest_(Calendar filteringDate) {
    super();
    this.filteringDate = filteringDate;
}

@Override
public Boolean call(Calendar arg0) throws Exception {
    // getStandardFormattedDate just prints a date in a given format
    System.out.println(TimeUtils.getStandardFormattedDate(arg0) + " - " + TimeUtils.getStandardFormattedDate(filteringDate));
    if(arg0.before(filteringDate)) {
        return false;
    }
    else { 
        return true;    
    }
  }
}

我能够真正理解的是我得到的输出。这似乎是我传递的固定日历作为参数,以便有时与变化进行比较(例如,当它Sat, 01 Jan 2016 22:00:00时)。

输出

Sat, 01 Jan 2016 08:00:00 - Fri, 01 Jan 2016 10:00:00
Sat, 01 Jan 2016 08:00:00 - Fri, 01 Jan 2016 10:00:00
Fri, 01 Jan 2016 08:00:00 - Fri, 01 Jan 2016 10:00:00
Sat, 02 Jan 2016 15:00:00 - Fri, 01 Jan 2016 09:00:00
Fri, 01 Jan 2016 09:00:00 - Fri, 01 Jan 2016 10:00:00
Sat, 02 Jan 2016 15:00:00 - Fri, 01 Jan 2016 09:00:00
Fri, 01 Jan 2016 10:00:00 - Fri, 01 Jan 2016 10:00:00
Sat, 02 Jan 2016 20:00:00 - Fri, 01 Jan 2016 10:00:00
Fri, 01 Jan 2016 10:00:00 - Fri, 01 Jan 2016 10:00:00
Fri, 01 Jan 2016 10:00:00 - Fri, 01 Jan 2016 10:00:00
Fri, 01 Jan 2016 10:00:00 - Fri, 02 Jan 2016 07:00:00
Sat, 02 Jan 2016 17:00:00 - Fri, 01 Jan 2016 10:00:00
Fri, 01 Jan 2016 11:00:00 - Fri, 01 Jan 2016 10:00:00
Sat, 02 Jan 2016 21:00:00 - Fri, 01 Jan 2016 10:00:00
Fri, 01 Jan 2016 10:00:00 - Sat, 01 Jan 2016 22:00:00
Fri, 01 Jan 2016 22:00:00 - Fri, 01 Jan 2016 10:00:00
Sat, 01 Jan 2016 22:00:00 - Fri, 01 Jan 2016 12:00:00
Fri, 01 Jan 2016 23:00:00 - Fri, 01 Jan 2016 10:00:00
Fri, 01 Jan 2016 23:00:00 - Fri, 01 Jan 2016 10:00:00
Fri, 01 Jan 2016 23:00:00 - Fri, 01 Jan 2016 10:00:00
Fri, 01 Jan 2016 10:00:00 - Fri, 02 Jan 2016 00:00:00
Sat, 02 Jan 2016 00:00:00 - Fri, 01 Jan 2016 19:00:00
Fri, 01 Jan 2016 13:00:00 - Fri, 02 Jan 2016 10:00:00
Sat, 01 Jan 2016 10:00:00 - Sat, 01 Jan 2016 14:00:00
Sat, 02 Jan 2016 01:00:00 - Fri, 01 Jan 2016 10:00:00
Fri, 01 Jan 2016 10:00:00 - Fri, 02 Jan 2016 11:00:00
Sat, 01 Jan 2016 14:00:00 - Fri, 02 Jan 2016 11:00:00
Sat, 02 Jan 2016 11:00:00 - Fri, 01 Jan 2016 15:00:00
Fri, 01 Jan 2016 15:00:00 - Sat, 02 Jan 2016 10:00:00
Sat, 02 Jan 2016 02:00:00 - Fri, 01 Jan 2016 10:00:00
Sat, 02 Jan 2016 15:00:00 - Sat, 02 Jan 2016 10:00:00
Fri, 01 Jan 2016 10:00:00 - Fri, 01 Jan 2016 10:00:00
Fri, 01 Jan 2016 10:00:00 - Fri, 01 Jan 2016 10:00:00
Fri, 01 Jan 2016 12:00:00 - Fri, 01 Jan 2016 22:00:00
Sat, 01 Jan 2016 17:00:00 - Fri, 02 Jan 2016 10:00:00
Fri, 01 Jan 2016 22:00:00 - Fri, 01 Jan 2016 10:00:00
Sat, 02 Jan 2016 13:00:00 - Fri, 01 Jan 2016 10:00:00
Sat, 01 Jan 2016 10:00:00 - Fri, 01 Jan 2016 10:00:00
Sat, 02 Jan 2016 23:00:00 - Fri, 01 Jan 2016 10:00:00

在计算分配到该变量期间究竟发生了什么?另外,因为显然结果是正确的,但我在更复杂的情况下调试此代码时遇到了麻烦。

0 个答案:

没有答案