所以我有一个Apache Spark流,每20分钟按天和小时写入S3镶木地板文件分区 它似乎在每个批次写入之前,它执行" ls"和#34;头"在此表(/根文件夹)名称的所有文件夹上。
由于我们有多天X 24小时不同的表格,因此整体上会产生相对较高的S3成本。
请注意我们的架构正在动态更改。
所以我的问题是:
写入递归读取所有镶木地板头是否正确?
为什么流不会缓存此信息/是否可以缓存它?
您能建议最好的做法吗?
//编写代码:
withPartition.write()
.format(format)
.mode(SaveMode.Append)
.partitionBy("day","hour")
.save(path);
这个问题似乎与:
有关答案 0 :(得分:0)
我发现spark partition by导致了这个问题:
Spark partitionBy much slower than without it
所以我按如下方式实现它,它解决了问题,而且它改善了性能:
withPartition = withPartition.persist(StorageLevel.MEMORY_AND_DISK());
Dataset<DayAndHour> daysAndHours = withPartition.map(mapToDayHour(), Encoders.bean(DayAndHour.class)).distinct();
DayAndHour[] collect = (DayAndHour[])daysAndHours.collect();
Arrays.sort(collect);
logger.info("found " + collect.length +" different days and hours: "
+ Arrays.stream(collect).map(DayAndHour::toString).collect(Collectors.joining(",")) );
long time = System.currentTimeMillis();
for(DayAndHour dayAndHour : collect){
int day = dayAndHour.getDay();
int hour = dayAndHour.getHour();
logger.info("Start filter on " + dayAndHour);
Dataset<Row> filtered = withPartition.filter(filterDayAndHour(day, hour))
.drop("day", hour");
String newPath = path + "/"
+ "day" +"=" +day +"/"
+ "hour" +"=" + hour;
long specificPathCount = filtered.count();
long timeStart = System.currentTimeMillis();
logger.info("writing " + specificPathCount+ " event to " + newPath );
filtered.write()
.format(format)
.mode(SaveMode.Append)
.save(newPath);
logger.info("Finish writing partition of " + dayAndHour+ " to "+ newPath+ ". Wrote [" + specificPathCount +"] events in " + TimeUtils.tookMinuteSecondsAndMillis(timeStart, System.currentTimeMillis()));
}
logger.info("Finish writing " + path+ ". Wrote [" + cnt +"] events in " + MinuteTimeUtils.tookMinuteSecondsAndMillis(time, System.currentTimeMillis()));
withPartition.unpersist();
private static MapFunction<Row, DayAndHour> mapToDayHour() {
return new MapFunction<Row, DayAndHour>() {
@Override
public DayAndHour call(Row value) throws Exception {
int day = value.getAs("day");
int hour = value.getAs(hour");
DayAndHour dayAndHour = new DayAndHour();
dayAndHour.setDay(day);
dayAndHour.setHour(hour);
return dayAndHour;
}
};
}
private static FilterFunction<Row> filterDayAndHour(int day, int hour) {
return new FilterFunction<Row>() {
@Override
public boolean call(Row value) throws Exception {
int cDay = value.getAs("day");
int cHour = value.getAs(hour");
return day == cDay && hour == cHour;
}
};
}
//和另一个POJO
public class DayAndHour implements Serializable , Comparable<DayAndHour>{
private int day;
private int hour;
public int getDay() {
return day;
}
public void setDay(int day) {
this.day = day;
}
public int getHour() {
return hour;
}
public void setHour(int hour) {
this.hour = hour;
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
DayAndHour that = (DayAndHour) o;
if (day != that.day) return false;
return hour == that.hour;
}
@Override
public int hashCode() {
int result = day;
result = 31 * result + hour;
return result;
}
@Override
public String toString() {
return "(" +
"day=" + day +
", hour=" + hour +
')';
}
@Override
public int compareTo(DayAndHour dayAndHour) {
return Integer.compare((day * 100) + hour, (dayAndHour.day * 100) + dayAndHour.hour);
}
}