Question

如何在应用转换之前将每个火花输入流中的数据集合并为一个。我正在使用spark-2.0.0

    val ssc = new StreamingContext(sc, Seconds(2))
val sqlContext = new SQLContext(sc)
val lines = ssc.textFileStream("input")

lines.foreachRDD { rdd =>
  val count = rdd.count()
  if (count > 0) {
    val dataSet = sqlContext.read.json(rdd)
    val accountIds = dataSet.select("accountId").distinct.collect.flatMap(_.toSeq)
    val accountIdArry = accountId.map(accountId => dataSet.where($"accountId" <=> accountId))
    accountIdArry.foreach { arrEle =>
      print(arrEle.count)
      arrEle.show
      arrEle.write.format("json").save("output")
    }
  }
}

我想通过考虑所有输入流将每个accountId计数大于100000的记录写入输出文件。为此，我想在执行转换之前将所有DStream合并为一个。

现在它将所有记录写入输出文件。有什么帮助吗？

更新

org.apache.spark.SparkException: Task not serializable

at org.apache.spark.util.ClosureCleaner $ .ensureSerializable（ClosureCleaner.scala：298）在org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean（ClosureCleaner.scala：288）在org.apache.spark.util.ClosureCleaner $ .clean（ClosureCleaner.scala：108）在org.apache.spark.SparkContext.clean（SparkContext.scala：2037）在org.apache.spark.streaming.dstream.PairDStreamFunctions $$ anonfun $ updateStateByKey $ 3.apply（PairDStreamFunctions.scala：433）在org.apache.spark.streaming.dstream.PairDStreamFunctions $$ anonfun $ updateStateByKey $ 3.apply（PairDStreamFunctions.scala：432）在org.apache.spark.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：151）在org.apache.spark.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：112）在org.apache.spark.SparkContext.withScope（SparkContext.scala：682）在org.apache.spark.streaming.StreamingContext.withScope（StreamingContext.scala：264）在org.apache.spark.streaming.dstream.PairDStreamFunctions.updateStateByKey（PairDStreamFunctions.scala：432）在org.apache.spark.streaming.dstream.PairDStreamFunctions $$ anonfun $ updateStateByKey $ 1.apply（PairDStreamFunctions.scala：400）在org.apache.spark.streaming.dstream.PairDStreamFunctions $$ anonfun $ updateStateByKey $ 1.apply（PairDStreamFunctions.scala：400）在org.apache.spark.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：151）在org.apache.spark.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：112）在org.apache.spark.SparkContext.withScope（SparkContext.scala：682）在org.apache.spark.streaming.StreamingContext.withScope（StreamingContext.scala：264）在org.apache.spark.streaming.dstream.PairDStreamFunctions.updateStateByKey（PairDStreamFunctions.scala：399）在SparkExample $ .main（：60） ...... 56岁引起：java.io.NotSerializableException：SparkExample $ 序列化堆栈： - 对象不可序列化（类：SparkExample $，value：SparkExample $ @ ab3b54） - field（类：SparkExample $$ anonfun $ 5，name：$ outer，type：class SparkExample $） - object（类SparkExample $$ anonfun $ 5，）在org.apache.spark.serializer.SerializationDebugger $ .improveException（SerializationDebugger.scala：40）在org.apache.spark.serializer.JavaSerializationStream.writeObject（JavaSerializer.scala：46）在org.apache.spark.serializer.JavaSerializerInstance.serialize（JavaSerializer.scala：100）在org.apache.spark.util.ClosureCleaner $ .ensureSerializable（ClosureCleaner.scala：295） ... 74更多

SparkExample.scala

    import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import play.api.libs.json._
import org.apache.spark.sql._
import org.apache.spark.streaming.dstream._


object SparkExample {
    def main(inputDir: String) {
        val ssc = new StreamingContext(sc, Seconds(2))
        val sqlContext = new SQLContext(sc)


        val lines: DStream[String] = ssc.textFileStream(inputDir)

        val jsonLines = lines.map[JsValue](l => Json.parse(l))

        val accountIdLines = jsonLines.map[(String, JsValue)](json => {
            val accountId = (json \ "accountId").as[String]
            (accountId, json)
        })

        val accountIdCounts = accountIdLines
            .map[(String, Long)]({ case (accountId, json) => {
            (accountId, 1)
        } })
        .reduceByKey((a, b) => a + b)


        // this DStream[(String, Long)] will have current accumulated count for accountId's
        val updatedAccountCounts = accountIdCounts
        .updateStateByKey(updatedCountOfAccounts _)
    }

    def updatedCountOfAccounts(a: Seq[Long], b: Option[Long]): Option[Long] = {
        b.map(i => i + a.sum).orElse(Some(a.sum))
    }
}

Answer 1

您需要牢记两件事。

首先 - 由于您使用StreamingContext 2 seconds微批量，dstreams将包含rdd只包含在这2秒内生成的数据而不是所有数据。如果您需要对当时可用的所有数据执行操作，那么流不适合您的问题。

第二 - 你不需要使用sql上下文来处理json。只需使用任何json库并在accountId上对rdd进行分组。

import play.api.libs.json._

val ssc = new StreamingContext(sc, Seconds(2))
val sqlContext = new SQLContext(sc)
val dstreams = ssc.textFileStream("input")


dstreams.foreachRDD { rdd =>
  val jsonRdd = rdd.map(l => Json.parse(l))
  val grouped = jsonRdd.groupBy(json => (json \ "accountId").as[String])
}

如果您想使用updateStateByKey，请继续使用DStreams，

import play.api.libs.json._

val ssc = new StreamingContext(sc, Seconds(2))
val sqlContext = new SQLContext(sc)


val lines: DStream[String] = ssc.textFileStream("inputPath")

val jsonLines = lines.map[JsValue](l => Json.parse(l))

val accountIdLines = jsonLines.map[(String, JsValue)](json => {
  val accountId = (json \ "accountId").as[String]
  (accountId, json)
})

val accounIdCounts = accountIdLines
  .map[(String, Long)]({ case (accountId, json) => {
    (accountId, 1)
  } })
  .reduceByKey((a, b) => a + b)


// this DStream[(String, Long)] will have current accumulated count for accountId's
val updatedAccountCounts = accountIdCounts
  .updateStateByKey(updateCountOfAccounts _)

def updatedCountOfAccounts(a: Seq[Long], b: Option[Long]): Option[Long] = {
  b.map(i => i + a.sum).orElse(Some(a.sum))
}

如何将每个输入流中的数据集合并为一个

1 个答案: