尝试测试Spark Structured Streams ...并且失败...我该如何正确测试它们?
我按照here中的一般Spark测试问题,我最接近的尝试是[1]看起来像:
import simpleSparkTest.SparkSessionTestWrapper
import org.scalatest.FunSpec
import org.apache.spark.sql.types.{StringType, IntegerType, DoubleType, StructType, DateType}
import org.apache.spark.sql.streaming.OutputMode
class StructuredStreamingSpec extends FunSpec with SparkSessionTestWrapper {
describe("Structured Streaming") {
it("Read file from system") {
val schema = new StructType()
.add("station_id", IntegerType)
.add("name", StringType)
.add("lat", DoubleType)
.add("long", DoubleType)
.add("dockcount", IntegerType)
.add("landmark", StringType)
.add("installation", DateType)
val sourceDF = spark.readStream
.option("header", "true")
.schema(schema)
.csv("/Spark-The-Definitive-Guide/data/bike-data/201508_station_data.csv")
.coalesce(1)
val countSource = sourceDF.count()
val query = sourceDF.writeStream
.format("memory")
.queryName("Output")
.outputMode(OutputMode.Append())
.start()
.processAllAvailable()
assert(countSource === 70)
}
}
}
可悲的是,它始终以org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()
我还在spark-testing-base repo上找到了这个issue,并想知道是否有可能测试Spark Structured Streaming?
我希望进行集成测试,甚至可以使用Kafka来测试Checkpointing或特定的损坏数据场景。有人可以帮助我吗?
最后但并非最不重要的是,我认为该版本也可能是一个约束 - 我目前正在针对2.1.0开发,因为Azure HDInsight部署选项,我需要它。如果这是阻力,则自托管是一种选择。
答案 0 :(得分:2)
您解决了吗?
您正在对流数据帧执行count(),然后再通过调用start()开始执行。 如果您想计数,怎么做?
sourceDF.writeStream
.format("memory")
.queryName("Output")
.outputMode(OutputMode.Append())
.start()
.processAllAvailable()
val results: List[Row] = spark.sql("select * from Output").collectAsList()
assert(results.size() === 70)
答案 1 :(得分:1)
您还可以使用@holdenk测试库中的StructuredStreamingBase特征: https://github.com/holdenk/spark-testing-base/blob/936c34b6d5530eb664e7a9f447ed640542398d7e/core/src/test/2.2/scala/com/holdenkarau/spark/testing/StructuredStreamingSampleTests.scala
以下是使用方法的示例:
class StructuredStreamingTests extends FunSuite with SharedSparkContext with StructuredStreamingBase {
override implicit def reuseContextIfPossible: Boolean = true
test("add 3") {
import spark.implicits._
val input = List(List(1), List(2, 3))
val expected = List(4, 5, 6)
def compute(input: Dataset[Int]): Dataset[Int] = {
input.map(elem => elem + 3)
}
testSimpleStreamEndState(spark, input, expected, "append", compute)
}}