如何为Sparks new Structured Streaming编写集成测试?

时间:2018-03-27 19:50:51

标签: apache-spark integration-testing scalatest

尝试测试Spark Structured Streams ...并且失败...我该如何正确测试它们?

我按照here中的一般Spark测试问题,我最接近的尝试是[1]看起来像:

import simpleSparkTest.SparkSessionTestWrapper
import org.scalatest.FunSpec  
import org.apache.spark.sql.types.{StringType, IntegerType, DoubleType, StructType, DateType}
import org.apache.spark.sql.streaming.OutputMode

class StructuredStreamingSpec extends FunSpec with SparkSessionTestWrapper {

  describe("Structured Streaming") {

    it("Read file from system") {

      val schema = new StructType()
        .add("station_id", IntegerType)
        .add("name", StringType)
        .add("lat", DoubleType)
        .add("long", DoubleType)
        .add("dockcount", IntegerType)
        .add("landmark", StringType)
        .add("installation", DateType)

      val sourceDF = spark.readStream
        .option("header", "true")
        .schema(schema)
        .csv("/Spark-The-Definitive-Guide/data/bike-data/201508_station_data.csv")
        .coalesce(1)

      val countSource = sourceDF.count()

      val query = sourceDF.writeStream
        .format("memory")
        .queryName("Output")
        .outputMode(OutputMode.Append())
        .start()
        .processAllAvailable()

      assert(countSource === 70)
    }

  }

}

可悲的是,它始终以org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()

失败

我还在spark-testing-base repo上找到了这个issue,并想知道是否有可能测试Spark Structured Streaming?

我希望进行集成测试,甚至可以使用Kafka来测试Checkpointing或特定的损坏数据场景。有人可以帮助我吗?

最后但并非最不重要的是,我认为该版本也可能是一个约束 - 我目前正在针对2.1.0开发,因为Azure HDInsight部署选项,我需要它。如果这是阻力,则自托管是一种选择。

2 个答案:

答案 0 :(得分:2)

您解决了吗?

您正在对流数据帧执行count(),然后再通过调用start()开始执行。 如果您想计数,怎么做?

  sourceDF.writeStream
    .format("memory")
    .queryName("Output")
    .outputMode(OutputMode.Append())
    .start()
    .processAllAvailable()

  val results: List[Row] = spark.sql("select * from Output").collectAsList()
  assert(results.size() === 70) 

答案 1 :(得分:1)

您还可以使用@holdenk测试库中的StructuredStreamingBase特征: https://github.com/holdenk/spark-testing-base/blob/936c34b6d5530eb664e7a9f447ed640542398d7e/core/src/test/2.2/scala/com/holdenkarau/spark/testing/StructuredStreamingSampleTests.scala

以下是使用方法的示例:

class StructuredStreamingTests extends FunSuite with SharedSparkContext with StructuredStreamingBase {

override implicit def reuseContextIfPossible: Boolean = true

test("add 3") {
    import spark.implicits._
    val input = List(List(1), List(2, 3))
    val expected = List(4, 5, 6)
    def compute(input: Dataset[Int]): Dataset[Int] = {
        input.map(elem => elem + 3)
    }
    testSimpleStreamEndState(spark, input, expected, "append", compute)
}}