Spark Mongodb连接器单元测试

时间:2016-10-06 22:37:16

标签: mongodb apache-spark spark-streaming

我正在尝试在我的测试框架中设置Spark-MongoDB连接器。我的StreamingContext设置如下:

val conf = new SparkConf()
          .setMaster("local[*]")
          .setAppName("test")
          .set("spark.mongodb.input.uri", "mongodb://localhost:27017/testdb.testread")
          .set("spark.mongodb.output.uri", "mongodb://localhost:27017/testdb.testwrite")

lazy val ssc = new StreamingContext(conf, Seconds(1))

每当我尝试设置这样的DStream时:

val records = new ConstantInputDStream(ssc, ssc.sparkContext.makeRDD(seq))

我遇到了这个错误

  

java.lang.IllegalStateException:无法在已停止的SparkContext上调用方法。

看起来上下文正在开始然后立即停止,但我无法弄清楚原因。日志不会给出任何错误。这是它完成开始然后立即停止的地方:

  

DEBUG] 2016-10-06 18:29:51,625 org.spark_project.jetty.util.component.AbstractLifeCycle setStarted - STARTED @ 4858ms o.s.j.s.ServletContextHandler@33b85bc {/ metrics / json,null,AVAILABLE}   [WARN] 2016-10-06 18:29:51,660 org.apache.spark.streaming.StreamingContext logWarning - StreamingContext尚未开始   [DEBUG] 2016-10-06 18:29:51,662 org.spark_project.jetty.util.component.AbstractLifeCycle setStopping - 停止org.spark_project.jetty.server.Server@2139a5fc   [DEBUG] 2016-10-06 18:29:51,664 org.spark_project.jetty.server.Server doStop - 正常关闭org.spark_project.jetty.server.Server@2139a5fc

当我删除mongodb连接设置时,它没有关闭,一切都很好(除了我不能读/写mongo :()

编辑: 这是我尝试写入mongo的测试。但是,在我达到这一点之前,我的测试套件失败了。

"read from kafka queue" in new SparkScope{

  val stream = KafkaUtils.createDirectStream[String, String](
    ssc,
    PreferConsistent,
    Subscribe[String, String](List("topic"),
      Map[String, Object](
        "bootstrap.servers"->s"localhost:${kServer.kafkaPort}",
        "key.deserializer" -> classOf[StringDeserializer],
        "value.deserializer" -> classOf[StringDeserializer],
        "group.id" -> "testing",
        "auto.offset.reset" -> "latest",
        "enable.auto.commit" -> (false: java.lang.Boolean)
      )
    )
  )
  val writeConfig = WriteConfig(Map(
    "collection"->"testcollection",
    "writeConcern.w"->"majority",
    "db"->"testdb"
  ), Some(WriteConfig(ssc.sparkContext)))

  stream.map(r => (r.key.toLong, r.value.toLong))
    .reduceByKey(_+_)
    .map{case (k,v) => {
      val d = new Document()
      d.put("key", k)
      d.put("value", v)
      d
    }}
    .foreachRDD(rdd => rdd.saveToMongoDB(writeConfig))

  ssc.start
  (1 until 10).foreach(x => producer.send(KafkaProducerRecord("topic", "1", "1")))
  ssc.awaitTerminationOrTimeout(1500)
  ok
}

当我尝试从scala集合创建流时,会发生失败:

"return a single record with the correct sum" in new SparkScope{
    val stream = new ConstantInputDStream(ssc, ssc.sparkContext.makeRDD(seq))
    val m = HashMap.empty[Long,Long]
    FlattenTimeSeries.flatten(stream).foreachRDD(rdd => m ++= rdd.collect())
    ssc.start()
    ssc.awaitTerminationOrTimeout(1500)
    m.size === 1 and m(1) === 20
  }

SparkScope类只是创建我在上面显示的StreamingContext并在测试后调用ssc.stop()

1 个答案:

答案 0 :(得分:1)

知道了。问题是SparkConf变量未声明为lazy,但StreamingContext是。我不确定为什么重要,但事实确实如此。固定的。