目前正在与同事讨论缓存如何在以下情况中受益:
val dataset1 = sparkSession.read.json("...") // very expensive read
val dataset2 = sparkSession.read.json("...") // very expensive read
val joinedDataset = dataset1.join(dataset2)
val reducedDataset = joinedDataset
.mapPartitions {
???
}
.groupByKey("key")
.reduceGroups {
???
}
reducedDataset.write.json("...")
是否有帮助(如果是,请解释原因)缓存 joinedDataset 以提高 reduce 操作的性能?
它将是:
val dataset1 = sparkSession.read.json("...") // very expensive read
val dataset2 = sparkSession.read.json("...") // very expensive read
val joinedDataset = dataset1.join(dataset2).cache
val reducedDataset = joinedDataset
.mapPartitions {
???
}
.groupByKey("key")
.reduceGroups {
???
}
reducedDataset.write.json("...")
答案 0 :(得分:0)
你应该对它进行基准测试,但它要么根本没有效果,要么甚至会降低性能:
cached
数据未被重复使用。即使它被重用,加入也会成为重新计算的障碍。