Spark写入文件并附加到s3 - 成本问题

时间:2017-08-31 12:48:22

标签: apache-spark amazon-s3 spark-streaming

所以我有一个Apache Spark流,每20分钟按天和小时写入S3镶木地板文件分区 它似乎在每个批次写入之前,它执行" ls"和#34;头"在此表(/根文件夹)名称的所有文件夹上。

由于我们有多天X 24小时不同的表格,因此整体上会产生相对较高的S3成本。

请注意我们的架构正在动态更改。

所以我的问题是:

  1. 写入递归读取所有镶木地板头是否正确?

  2. 为什么流不会缓存此信息/是否可以缓存它?

  3. 您能建议最好的做法吗?

  4. //编写代码:

      withPartition.write()
                    .format(format)
                    .mode(SaveMode.Append)
                    .partitionBy("day","hour")
                    .save(path);
    

    这个问题似乎与:

    有关

    https://issues.apache.org/jira/browse/SPARK-20049

    Spark partitionBy much slower than without it

1 个答案:

答案 0 :(得分:0)

我发现spark partition by导致了这个问题:

Spark partitionBy much slower than without it

所以我按如下方式实现它,它解决了问题,而且它改善了性能:

 withPartition = withPartition.persist(StorageLevel.MEMORY_AND_DISK());
    Dataset<DayAndHour> daysAndHours = withPartition.map(mapToDayHour(), Encoders.bean(DayAndHour.class)).distinct();

    DayAndHour[] collect = (DayAndHour[])daysAndHours.collect();
    Arrays.sort(collect);
    logger.info("found " + collect.length +" different days and hours: "
            + Arrays.stream(collect).map(DayAndHour::toString).collect(Collectors.joining(","))  );
    long time = System.currentTimeMillis();
    for(DayAndHour dayAndHour : collect){
        int day = dayAndHour.getDay();
        int hour = dayAndHour.getHour();
        logger.info("Start filter on " + dayAndHour);
        Dataset<Row> filtered = withPartition.filter(filterDayAndHour(day, hour))
                .drop("day", hour");

            String newPath = path + "/"
                    + "day" +"=" +day +"/"
                    + "hour" +"=" + hour;

            long specificPathCount = filtered.count();
            long timeStart = System.currentTimeMillis();
            logger.info("writing " + specificPathCount+  " event to " + newPath  );

            filtered.write()
                    .format(format)
                    .mode(SaveMode.Append)
                    .save(newPath);

            logger.info("Finish writing partition of " + dayAndHour+  " to "+ newPath+ ". Wrote [" + specificPathCount  +"] events  in " + TimeUtils.tookMinuteSecondsAndMillis(timeStart, System.currentTimeMillis()));
 }
    logger.info("Finish writing " + path+  ". Wrote [" + cnt  +"] events  in " + MinuteTimeUtils.tookMinuteSecondsAndMillis(time, System.currentTimeMillis()));
    withPartition.unpersist();

private static  MapFunction<Row, DayAndHour> mapToDayHour() {
    return new MapFunction<Row, DayAndHour>() {
        @Override
        public DayAndHour call(Row value) throws Exception {
            int day = value.getAs("day");
            int hour = value.getAs(hour");
            DayAndHour dayAndHour = new DayAndHour();
            dayAndHour.setDay(day);
            dayAndHour.setHour(hour);
            return dayAndHour;
        }
    };
}

private static  FilterFunction<Row> filterDayAndHour(int day, int hour) {
    return new FilterFunction<Row>() {
        @Override
        public boolean call(Row value) throws Exception {
            int cDay = value.getAs("day");
            int cHour = value.getAs(hour");

            return day == cDay && hour == cHour;
        }
    };
}

//和另一个POJO

public class DayAndHour implements Serializable , Comparable<DayAndHour>{

    private int day;
    private int hour;

    public int getDay() {
        return day;
    }

    public void setDay(int day) {
        this.day = day;
    }

    public int getHour() {
        return hour;
    }

    public void setHour(int hour) {
        this.hour = hour;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;

        DayAndHour that = (DayAndHour) o;

        if (day != that.day) return false;
        return hour == that.hour;
    }

    @Override
    public int hashCode() {
        int result = day;
        result = 31 * result + hour;
        return result;
    }

    @Override
    public String toString() {
        return "(" +
                "day=" + day +
                ", hour=" + hour +
                ')';
    }

    @Override
    public int compareTo(DayAndHour dayAndHour) {
        return Integer.compare((day * 100) + hour, (dayAndHour.day * 100) + dayAndHour.hour);
    }
}