我对Scala Spark生态系统非常陌生,想知道什么是对链式数据框转换进行单元测试的最佳方法。所以这是我要测试的方法的代码示例
def writeToParquet(spark: SparkSession, dataFrame: DataFrame, col1: DataType1, col2:DataType2): Unit {
dataFrame
.withColumn("date", some_columnar_date_logic)
.withColumn("hour", some_more_functional_logic)
.... //couple more transformation logic
.write
.mode(SaveMode.Append)
.partitionBy("col1", "col2", "col3")
.parquet("some hdfs/s3/url")
}
问题在于实木复合地板属于Unit
返回类型,这使测试变得困难。
问题是,转换本质上是不可变的,这使得模拟和监视变得有些困难
要创建数据框,我将测试数据集转储到了CSV
答案 0 :(得分:7)
请找到用于数据帧单元测试的简单示例。您可以将其分为两部分。第一。测试转换,您可以执行简单的shell脚本来测试写入的文件
import com.holdenkarau.spark.testing._
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.scalatest.{FunSuite, Matchers}
class SomeDFTest extends FunSuite with Matchers with DataFrameSuiteBase {
import spark.implicits._
test("Testing Input customer data date transformation") {
val inputSchema = List(
StructField("number", IntegerType, false),
StructField("word", StringType, false)
)
val expectedSchema = List(
StructField("number", IntegerType, false),
StructField("word", StringType, false),
StructField("dummyColumn", StringType, false)
)
val inputData = Seq(
Row(8, "bat"),
Row(64, "mouse"),
Row(-27, "horse")
)
val expectedData = Seq(
Row (8, "bat","test"),
Row(64, "mouse","test"),
Row(-27, "horse","test")
)
val inputDF = spark.createDataFrame(
spark.sparkContext.parallelize(inputData),
StructType(inputSchema)
)
val expectedDF = spark.createDataFrame(
spark.sparkContext.parallelize(expectedData),
StructType(expectedSchema)
)
val actual = transformSomeDf(inputDF)
assertDataFrameEquals(actual, expectedDF) // equal
}
def transformSomeDf(df:DataFrame):DataFrame={
df.withColumn("dummyColumn",lit("test"))
}
}
Sbt.build配置
name := "SparkTest"
version := "0.1"
scalaVersion := "2.11.8"
val sparkVersion = "2.3.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-hive" % sparkVersion % "provided",
"com.holdenkarau" %% "spark-testing-base" % "2.4.0_0.11.0" % Test
)
答案 1 :(得分:0)
我在测试数据框时发现的第一件事就是将转换和IO分开
对于上述情况 我们可以将上述链条分为三个部分
class Coordinator {
def transformAndWrite(dataframe: Dataframe): Unit = {
transformedDf = dataFrame
.withColumn("date", some_columnar_date_logic)
.withColumn("hour", some_more_functional_logic)
.... //couple more transformation logic
partitionedDfWriter = transformedDf.write
.mode(SaveMode.Append)
.partitionBy("col1", "col2", "col3")
和
partitionedDfWriter.parquet("some hdfs/s3/url")
}
现在我们可以将它们移到三个单独的类中,
DFTransformer
,DFPartitioner
和
DataFrameParquetWriter extends ResourceWriter
所以代码将变成这样
class DFTransformer {
def transform(dataframe:DataFrame): Dataframe = {
return dataFrame
.withColumn("date", some_columnar_date_logic)
.withColumn("hour", some_more_functional_logic)
.... //couple more transformation logic
}
class DfPartitioner {
def partition(dataframe: DataFrame): DataFrameWriter = {
return dataframe.write
.mode(SaveMode.Append)
.partitionBy("col1", "col2", "col3")
}
}
和
class DataFrameParquetWriter extends ResourceWriter {
overide def write(partitionedDfWriter: DataFrameWriter) = {
partitionedDfWriter.parquet("some hdfs/s3/url")
}
class Coordinator(dfTransformer:DfTransformer, dfPartitioner: DFPartitioner, resourceWriter: ResourceWriter) {
val transformedDf = dfTransformer.transform(dataframe)
val partitionedDfWriter = dfPartitioner.partition(transformedDf)
resourceWriter.write(partitionedDfWriter)
}
上述优点是,当您必须测试Coordinator类时,可以非常轻松地使用Mockito
来模拟依赖项。
测试DFTransformer
现在也很容易,
您可以传递存根数据框并声明返回的数据框。(使用spark-testing-base)。我们还可以测试转换返回的列。我们也可以测试计数