如何在应用转换之前将每个火花输入流中的数据集合并为一个。我正在使用spark-2.0.0
val ssc = new StreamingContext(sc, Seconds(2))
val sqlContext = new SQLContext(sc)
val lines = ssc.textFileStream("input")
lines.foreachRDD { rdd =>
val count = rdd.count()
if (count > 0) {
val dataSet = sqlContext.read.json(rdd)
val accountIds = dataSet.select("accountId").distinct.collect.flatMap(_.toSeq)
val accountIdArry = accountId.map(accountId => dataSet.where($"accountId" <=> accountId))
accountIdArry.foreach { arrEle =>
print(arrEle.count)
arrEle.show
arrEle.write.format("json").save("output")
}
}
}
我想通过考虑所有输入流将每个accountId计数大于100000的记录写入输出文件。为此,我想在执行转换之前将所有DStream合并为一个。
现在它将所有记录写入输出文件。有什么帮助吗?
更新
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:298) 在org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean(ClosureCleaner.scala:288) 在org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:108) 在org.apache.spark.SparkContext.clean(SparkContext.scala:2037) 在org.apache.spark.streaming.dstream.PairDStreamFunctions $$ anonfun $ updateStateByKey $ 3.apply(PairDStreamFunctions.scala:433) 在org.apache.spark.streaming.dstream.PairDStreamFunctions $$ anonfun $ updateStateByKey $ 3.apply(PairDStreamFunctions.scala:432) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112) 在org.apache.spark.SparkContext.withScope(SparkContext.scala:682) 在org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:264) 在org.apache.spark.streaming.dstream.PairDStreamFunctions.updateStateByKey(PairDStreamFunctions.scala:432) 在org.apache.spark.streaming.dstream.PairDStreamFunctions $$ anonfun $ updateStateByKey $ 1.apply(PairDStreamFunctions.scala:400) 在org.apache.spark.streaming.dstream.PairDStreamFunctions $$ anonfun $ updateStateByKey $ 1.apply(PairDStreamFunctions.scala:400) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112) 在org.apache.spark.SparkContext.withScope(SparkContext.scala:682) 在org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:264) 在org.apache.spark.streaming.dstream.PairDStreamFunctions.updateStateByKey(PairDStreamFunctions.scala:399) 在SparkExample $ .main(:60) ...... 56岁 引起:java.io.NotSerializableException:SparkExample $ 序列化堆栈: - 对象不可序列化(类:SparkExample $,value:SparkExample $ @ ab3b54) - field(类:SparkExample $$ anonfun $ 5,name:$ outer,type:class SparkExample $) - object(类SparkExample $$ anonfun $ 5,) 在org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40) 在org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) 在org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) 在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:295) ... 74更多
SparkExample.scala
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import play.api.libs.json._
import org.apache.spark.sql._
import org.apache.spark.streaming.dstream._
object SparkExample {
def main(inputDir: String) {
val ssc = new StreamingContext(sc, Seconds(2))
val sqlContext = new SQLContext(sc)
val lines: DStream[String] = ssc.textFileStream(inputDir)
val jsonLines = lines.map[JsValue](l => Json.parse(l))
val accountIdLines = jsonLines.map[(String, JsValue)](json => {
val accountId = (json \ "accountId").as[String]
(accountId, json)
})
val accountIdCounts = accountIdLines
.map[(String, Long)]({ case (accountId, json) => {
(accountId, 1)
} })
.reduceByKey((a, b) => a + b)
// this DStream[(String, Long)] will have current accumulated count for accountId's
val updatedAccountCounts = accountIdCounts
.updateStateByKey(updatedCountOfAccounts _)
}
def updatedCountOfAccounts(a: Seq[Long], b: Option[Long]): Option[Long] = {
b.map(i => i + a.sum).orElse(Some(a.sum))
}
}
答案 0 :(得分:1)
您需要牢记两件事。
首先 - 由于您使用StreamingContext
2 seconds
微批量,dstreams
将包含rdd只包含在这2秒内生成的数据而不是所有数据。如果您需要对当时可用的所有数据执行操作,那么流不适合您的问题。
第二 - 你不需要使用sql上下文来处理json。只需使用任何json库并在accountId
上对rdd进行分组。
import play.api.libs.json._
val ssc = new StreamingContext(sc, Seconds(2))
val sqlContext = new SQLContext(sc)
val dstreams = ssc.textFileStream("input")
dstreams.foreachRDD { rdd =>
val jsonRdd = rdd.map(l => Json.parse(l))
val grouped = jsonRdd.groupBy(json => (json \ "accountId").as[String])
}
如果您想使用updateStateByKey
,请继续使用DStreams
,
import play.api.libs.json._
val ssc = new StreamingContext(sc, Seconds(2))
val sqlContext = new SQLContext(sc)
val lines: DStream[String] = ssc.textFileStream("inputPath")
val jsonLines = lines.map[JsValue](l => Json.parse(l))
val accountIdLines = jsonLines.map[(String, JsValue)](json => {
val accountId = (json \ "accountId").as[String]
(accountId, json)
})
val accounIdCounts = accountIdLines
.map[(String, Long)]({ case (accountId, json) => {
(accountId, 1)
} })
.reduceByKey((a, b) => a + b)
// this DStream[(String, Long)] will have current accumulated count for accountId's
val updatedAccountCounts = accountIdCounts
.updateStateByKey(updateCountOfAccounts _)
def updatedCountOfAccounts(a: Seq[Long], b: Option[Long]): Option[Long] = {
b.map(i => i + a.sum).orElse(Some(a.sum))
}