Question

随着我的spark应用程序的增长，我注意到由于实际代码与仅在调试模式下执行的代码相混合而变得越来越难以阅读。

newRDD = doSomething(initialRDD)
if (debugMode) {
    newRDD.cache
    newRDD.foreach { row => logDebug("Rows of newRDD: " + row.toString.substring(0, 200)) }
    doMoreStuff()
}
finalRDD = doSomethingElse(newRDD)
finalRDD.count
logInfo("Part 1 completed")

清理此类情况的最佳方法是什么？

Answer 1

这是我用于此目的的一个技巧，使用“增强我的库”模式：

// add "wrapper" class for RDD with new printDebugRecords method:
class RDDDebugFunction[K](rdd: RDD[K]) {

  def printDebugRecords(msgFormat: K => String): RDD[K] = {
    if (isDebugMode) {
      rdd.cache
      rdd.foreach { row => logDebug(msgFormat(row)) }
      doMoreStuff()
    }
    rdd
  }

  def isDebugMode: Boolean = ???
  def logDebug(s: String) = ???
  def doMoreStuff(): Unit = ???
}

// add implicit conversion from RDD to our new class
object RDDDebugFunction {
  implicit def toDebugFunction[K](rdd: RDD[K]): RDDDebugFunction[K] = new RDDDebugFunction(rdd)
}

现在，通过导入RDDDebugFunction._，我们可以调用我们的新方法：

val rdd = sc.parallelize(Seq(1,2,3,4))

import RDDDebugFunction._

rdd.printDebugRecords(row => "Rows of newRDD: " + row.toString.substring(0, 200))
rdd.count

你如何在Spark中编写干净的调试语句？

1 个答案: