使用Spark Testing Base库创建Spark DataFrame的最佳方法是什么?

时间:2017-12-19 12:36:18

标签: scala unit-testing apache-spark spark-dataframe

我正在为一个Spark方法编写单元测试,该方法将多个数据帧作为输入参数并返回一个数据帧。 spark方法的代码如下所示:

class processor {
    def process(df1: DataFrame, df2: DataFrame): DataFrame = {
      // process and return resulting data frame
    }
}

相应单元测试的现有代码如下:

import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.sql.DataFrame
import org.scalatest.{FlatSpec, Matchers}

class TestProcess extends FlatSpec with DataFrameSuiteBase with Matchers {

  val p:Processor = new Processor

  "process()" should "return only one row" in {
    df1RDD = sc.parallelize(
      Seq("a", 12, 98999),
      Seq("b", 42, 99)
    )
   df1DF = spark.createDataFrame(df1RDD).toDF()

    df2RDD = sc.parallelize(
      Seq("X", 12, "foo", "spark"),
      Seq("Z", 42, "bar", "storm")
    )
   df2DF = spark.createDataFrame(df2RDD).toDF()

  val result = p.process(df1, df2)
  }

  it should "return spark row" in {
    df1RDD = sc.parallelize(
      Seq("a", 12, 98999),
      Seq("b", 42, 99)
    )
   df1DF = spark.createDataFrame(df1RDD).toDF()

    df2RDD = sc.parallelize(
      Seq("X", 12, "foo", "spark"),
      Seq("Z", 42, "bar", "storm")
    )
   df2DF = spark.createDataFrame(df2RDD).toDF()

  val result = p.process(df1, df2)
  }
}

此代码工作正常,但是在每个测试方法中都存在创建RDD和DF的代码重复的问题。当我尝试在测试方法之外或在BeforeAndAfterAll()方法内创建RDD时,我收到有关sc不可用的错误。似乎Spark Testing Base库仅在测试方法中启动scspark变量。

我想知道是否有任何方法可以避免编写此重复代码?

使用WordSpec而非使用FlatSpec

后更新了代码
import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.sql.DataFrame
import org.scalamock.scalatest.MockFactory
import org.scalatest.{Matchers, WordSpec}

class TestProcess extends WordSpec with DataFrameSuiteBase with Matchers {

  val p:Processor = new Processor

  "process()" should {
    df1RDD = sc.parallelize(
        Seq("a", 12, 98999),
        Seq("b", 42, 99)
      )
    df1DF = spark.createDataFrame(df1RDD).toDF()

    df2RDD = sc.parallelize(
        Seq("X", 12, "foo", "spark"),
        Seq("Z", 42, "bar", "storm")
    )
    df2DF = spark.createDataFrame(df2RDD).toDF()
    val result = p.process(df1, df2)

    "return only one row" in {             
      result.count should equal(1)
    }

    "return spark row" in {
      // assertions to check if 'row' containing 'spark' in last column is in the result or not
    }
  }
}

1 个答案:

答案 0 :(得分:3)

使用WordSpec代替FlatSpec,因为它允许在测试子句之前对常规初始化进行分组,如

"process()" should {
     df1RDD = sc.parallelize(Seq("a", 12, 98999),Seq("b", 42, 99))
     df1DF = spark.createDataFrame(df1RDD).toDF()
     df2RDD = sc.parallelize(Seq("X", 12, "foo", "spark"), Seq("Z", 42, "bar", "storm"))
     df2DF = spark.createDataFrame(df2RDD).toDF()
     "return only one row" in {
         ....
     }
     "return spark row" in {
         ....
     }
}

编辑:此外,以下两行代码几乎没有理由使用库(spark-testing-base):

val spark = SparkSession.builder.master("local[1]").getOrCreate
val sc = spark.sparkContext

将这些添加到您的课程顶部,并且您使用SparkContext和所有设置,并且没有NPE。

编辑:我刚刚通过自己的测试证实,火花测试基础 与WordSpec不兼容。如果您仍想使用它,请考虑与库作者一起打开错误报告,因为这肯定是spark-testing-base的一个问题。