使用dataframe进行火花单元测试:收集返回空数组

时间:2015-08-14 07:40:17

标签: scala unit-testing apache-spark-sql

我正在使用spark,我一直在努力使用Dataframe和Spark SQL进行简单的单元测试。

以下是代码段:

class TestDFSpec extends SharedSparkContext  { 
  "Test DF " should { 
    "pass equality" in { 
      val createDF = sqlCtx.createDataFrame(createsRDD,classOf[Test]).toDF() 
      createDF.registerTempTable("test") 

      sqlCtx.sql("select * FROM test").collectAsList() === List(Row(Test.from(create1)),Row(Test.from(create2))) 
    } 
  } 
  val create1 = "4869215,bbbbb" 
  val create2 = "4869215,aaaaa" 
  val createsRDD = sparkContext.parallelize(Seq(create1,create2)).map(Test.from) 
}

我从spark github复制代码并添加一些小改动以提供SQLContext:

trait SharedSparkContext extends Specification with BeforeAfterAll { 
  import net.lizeo.bi.spark.conf.JobConfiguration._ 

  @transient private var _sql: SQLContext = _ 

  def sqlCtx: SQLContext = _sql 

  override def beforeAll() { 

    println(sparkConf) 

    _sql = new SQLContext(sparkContext) 

  } 

  override def afterAll() { 
    sparkContext.stop() 
    _sql =  null 

  } 
} 

模型测试非常简单:

case class Test(key:Int, value:String) 

  object Test { 
    def from(line:String):Test = { 
      val f = line.split(",") 
      Test(f(0).toInt,f(1)) 
    } 
  }

作业配置对象:

object JobConfiguration {
  val conf = ConfigFactory.load()

  val sparkName = conf.getString("spark.name")
  val sparkMaster = conf.getString("spark.master")

  lazy val sparkConf = new SparkConf()
    .setAppName(sparkName)
    .setMaster(sparkMaster)
    .set("spark.executor.memory",conf.getString("spark.executor.memory"))         
    .set("spark.io.compression.codec",conf.getString("spark.io.compression.codec"))

  val sparkContext = new SparkContext(sparkConf)  
}

我使用Spark 1.3.0和Spec2。我的sbt项目文件的确切依赖关系是:

object Dependencies { 
  private val sparkVersion = "1.3.0" 
  private val clouderaVersion = "5.4.4" 

  private val sparkClouderaVersion = s"$sparkVersion-cdh$clouderaVersion" 

  val sparkCdhDependencies = Seq( 
    "org.apache.spark" %% "spark-core" % sparkClouderaVersion % "provided", 
    "org.apache.spark" %% "spark-sql" % sparkClouderaVersion % "provided" 
    ) 

} 

测试输出为:

[info] TestDFSpec  
[info]  
[info] Test DF  should  
[error]   x pass equality  
[error]    '[[], []]'  
[error]  
[error]     is not equal to  
[error]  
[error]    List([Test(4869215,bbbbb)], [Test(4869215,aaaaa)]) (TestDFSpec.scala:17)  
[error] Actual:   [[], []]  [error] Expected: List([Test(4869215,bbbbb)], [Test(4869215,aaaaa)])

sqlCtx.sql("select * FROM test").collectAsList() return [[], []] 

非常感谢任何帮助。我没有遇到任何使用RDD测试的问题 我确实希望从RDD迁移到Dataframe,并能够直接从Spark使用Parquet来存储文件

提前致谢

1 个答案:

答案 0 :(得分:2)

测试通过以下代码:

class TestDFSpec extends SharedSparkContext  {
  import sqlCtx.implicits._
  "Test DF " should {
    "pass equality" in {
      val createDF = sqlCtx.createDataFrame(Seq(create1,create2).map(Test.from))
      createDF.registerTempTable("test")
      val result = sqlCtx.sql("select * FROM test").collect()
      result === Array(Test.from(create1),Test.from(create2)).map(Row.fromTuple)
    }
  }

  val create1 = "4869215,bbbbb"
  val create2 = "4869215,aaaaa"
}

主要区别在于创建DataFrame的方式:从Seq [Test]而不是RDD [Test]

我在spark邮件中询问了一些解释:http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-dataframe-td24240.html#none