我使用Spark 2.1.1。我做了很多连接并在一个循环中选择输入DS(inputDs),它看起来像这样:
val myDs = Iterator.iterate(fromDate)(_.plus(ofHours(1))).takeWhile(_.isBefore(toDate)).map(next => {
getDsForOneHour(inputDs, next.getYear, next.getMonthValue, next.getDayOfMonth, next.getHour)
}).reduce(_.union(_))
def getDsForOneHour(ds: Dataset[I], year:Int, month:Int, day:Int, hour: Int)(implicit sql: SQLImplicits):Dataset[I]= {
ds.where(col("year") === year and col("month") === month and col("day") === day and col("hour") === hour)
}
我使用spark-testing-base运行该代码,完成一个月的操作大约需要3分钟(约30 * 24个工会和选择)。这些都是懒惰的操作我想知道为什么需要花费很多时间来构建myDs?
答案 0 :(得分:1)
我猜它很慢,因为针对循环中联合的每个新数据集更新了执行计划。您可以重写代码以首先构建过滤器:
def getFilterForOneHour(year:Int, month:Int, day:Int, hour: Int): Column = {
col("year") === year and col("month") === month and col("day") === day and col("hour") === hour
}
val myFilter = Iterator.iterate(fromDate)(_.plus(ofHours(1))).takeWhile(_.isBefore(toDate)).map(next => {
getFilterForOneHour(next.getYear, next.getMonthValue, next.getDayOfMonth, next.getHour)
}).reduce(_ or _)
val myDs = inputDs.where(myFilter)
编辑: 你还可以尝试做一个分组联合(我的情况下,批量大小为50)。我运行了一些虚拟内存数据集的测试,在我的情况下,这个性能提高了8倍:
val myDs = Iterator.iterate(fromDate)(_.plus(ofHours(1))).takeWhile(_.isBefore(toDate)).map(next => {
getDsForOneHour(inputDs, next.getYear, next.getMonthValue, next.getDayOfMonth, next.getHour)
})
.grouped(50).map(dss => dss.reduce(_ union _))
.reduce(_ union _)