在groupByKey中缓存或不缓存

时间:2018-02-01 16:04:21

标签: scala performance apache-spark caching

目前正在与同事讨论缓存如何在以下情况中受益:

  val dataset1 = sparkSession.read.json("...") // very expensive read
  val dataset2 = sparkSession.read.json("...") // very expensive read

  val joinedDataset = dataset1.join(dataset2)

  val reducedDataset = joinedDataset
    .mapPartitions {
      ???
    }
    .groupByKey("key")
    .reduceGroups {
      ???
    }

  reducedDataset.write.json("...")

是否有帮助(如果是,请解释原因)缓存 joinedDataset 以提高 reduce 操作的性能?

它将是:

  val dataset1 = sparkSession.read.json("...") // very expensive read
  val dataset2 = sparkSession.read.json("...") // very expensive read

  val joinedDataset = dataset1.join(dataset2).cache

  val reducedDataset = joinedDataset
    .mapPartitions {
      ???
    }
    .groupByKey("key")
    .reduceGroups {
      ???
    }

  reducedDataset.write.json("...")

1 个答案:

答案 0 :(得分:0)

你应该对它进行基准测试,但它要么根本没有效果,要么甚至会降低性能:

  • 完全没有效果,因为cached数据未被重复使用。即使它被重用,加入也会成为重新计算的障碍。
  • 可能会降低性能,因为缓存通常很昂贵。