Question

尝试测试Spark Structured Streams ...并且失败...我该如何正确测试它们？

我按照here中的一般Spark测试问题，我最接近的尝试是[1]看起来像：

import simpleSparkTest.SparkSessionTestWrapper
import org.scalatest.FunSpec  
import org.apache.spark.sql.types.{StringType, IntegerType, DoubleType, StructType, DateType}
import org.apache.spark.sql.streaming.OutputMode

class StructuredStreamingSpec extends FunSpec with SparkSessionTestWrapper {

  describe("Structured Streaming") {

    it("Read file from system") {

      val schema = new StructType()
        .add("station_id", IntegerType)
        .add("name", StringType)
        .add("lat", DoubleType)
        .add("long", DoubleType)
        .add("dockcount", IntegerType)
        .add("landmark", StringType)
        .add("installation", DateType)

      val sourceDF = spark.readStream
        .option("header", "true")
        .schema(schema)
        .csv("/Spark-The-Definitive-Guide/data/bike-data/201508_station_data.csv")
        .coalesce(1)

      val countSource = sourceDF.count()

      val query = sourceDF.writeStream
        .format("memory")
        .queryName("Output")
        .outputMode(OutputMode.Append())
        .start()
        .processAllAvailable()

      assert(countSource === 70)
    }

  }

}

可悲的是，它始终以org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()

失败

我还在spark-testing-base repo上找到了这个issue，并想知道是否有可能测试Spark Structured Streaming？

我希望进行集成测试，甚至可以使用Kafka来测试Checkpointing或特定的损坏数据场景。有人可以帮助我吗？

最后但并非最不重要的是，我认为该版本也可能是一个约束 - 我目前正在针对2.1.0开发，因为Azure HDInsight部署选项，我需要它。如果这是阻力，则自托管是一种选择。

Answer 1

您解决了吗？

您正在对流数据帧执行count（），然后再通过调用start（）开始执行。如果您想计数，怎么做？

  sourceDF.writeStream
    .format("memory")
    .queryName("Output")
    .outputMode(OutputMode.Append())
    .start()
    .processAllAvailable()

  val results: List[Row] = spark.sql("select * from Output").collectAsList()
  assert(results.size() === 70)

Answer 2

您还可以使用@holdenk测试库中的StructuredStreamingBase特征： https://github.com/holdenk/spark-testing-base/blob/936c34b6d5530eb664e7a9f447ed640542398d7e/core/src/test/2.2/scala/com/holdenkarau/spark/testing/StructuredStreamingSampleTests.scala

以下是使用方法的示例：

class StructuredStreamingTests extends FunSuite with SharedSparkContext with StructuredStreamingBase {

override implicit def reuseContextIfPossible: Boolean = true

test("add 3") {
    import spark.implicits._
    val input = List(List(1), List(2, 3))
    val expected = List(4, 5, 6)
    def compute(input: Dataset[Int]): Dataset[Int] = {
        input.map(elem => elem + 3)
    }
    testSimpleStreamEndState(spark, input, expected, "append", compute)
}}

如何为Sparks new Structured Streaming编写集成测试？

2 个答案: