Question

我使用Spark 2.1.1。我做了很多连接并在一个循环中选择输入DS（inputDs），它看起来像这样：

val myDs =  Iterator.iterate(fromDate)(_.plus(ofHours(1))).takeWhile(_.isBefore(toDate)).map(next => {
getDsForOneHour(inputDs, next.getYear, next.getMonthValue, next.getDayOfMonth, next.getHour)
}).reduce(_.union(_))

def getDsForOneHour(ds: Dataset[I], year:Int, month:Int, day:Int, hour: Int)(implicit sql: SQLImplicits):Dataset[I]= {
ds.where(col("year") === year and col("month") ===  month and col("day") ===  day and col("hour") === hour)
}

我使用spark-testing-base运行该代码，完成一个月的操作大约需要3分钟（约30 * 24个工会和选择）。这些都是懒惰的操作我想知道为什么需要花费很多时间来构建myDs？

Answer 1

我猜它很慢，因为针对循环中联合的每个新数据集更新了执行计划。您可以重写代码以首先构建过滤器：

def getFilterForOneHour(year:Int, month:Int, day:Int, hour: Int): Column = {
  col("year") === year and col("month") ===  month and col("day") ===  day and col("hour") === hour
} 


val myFilter =  Iterator.iterate(fromDate)(_.plus(ofHours(1))).takeWhile(_.isBefore(toDate)).map(next => {
getFilterForOneHour(next.getYear, next.getMonthValue, next.getDayOfMonth, next.getHour)
}).reduce(_ or _)

val myDs = inputDs.where(myFilter)

编辑：你还可以尝试做一个分组联合（我的情况下，批量大小为50）。我运行了一些虚拟内存数据集的测试，在我的情况下，这个性能提高了8倍：

val myDs =  Iterator.iterate(fromDate)(_.plus(ofHours(1))).takeWhile(_.isBefore(toDate)).map(next => {
getDsForOneHour(inputDs, next.getYear, next.getMonthValue, next.getDayOfMonth, next.getHour)
})
.grouped(50).map(dss => dss.reduce(_ union _))
.reduce(_ union _)

在数据集上缓慢的工会

1 个答案: