我有一个元组:
val key = List(protocol, source, destination, port)
为每个rdd。
我需要将其映射到
(protocol ,List(source, destination, port))
然后应该缩小为
列表List(source,(destination1, destination2))
按protocol
分组。
最后,它应该像一个元组
(protocol, (source, (destination1, destination2)))
我需要的输出如下:
{(tcp , (xx.xx.xx.xx ,(ww.ww.w.w,rr.rr.r.r))) , (udp,(yy.yy.yy.yy,(ww.ww.w.w,rr.rr.r.r)))}
代码是:
val lines = KafkaUtils.createStream[String, PcapPacket, StringDecoder,PcapDecoder](ssc, kafkaParams, Map(topics -> 1), StorageLevel.MEMORY_ONLY)
val m = lines.window(Seconds(4), Seconds(4)).mapPartitions(x =>
x.map{y => analysis(y._2)}
)
这将给出(5个字段作为输出)
答案 0 :(得分:0)
您可以先按(协议,来源)分组,然后按(协议)进行分组。如,
import org.apache.spark.SparkContext._
val testData = List(
("tcp", "xx.xx.xx.xx", "ww.ww.w.w", 12345),
("tcp", "xx.xx.xx.xx", "rr.rr.r.r", 12345),
("udp", "yy.yy.yy.yy", "ww.ww.w.w", 12345),
("udp", "yy.yy.yy.yy", "rr.rr.r.r", 12345)
)
val rdd = sc.parallelize(testData)
val resultRDD = rdd.map {
case (protocol, source, destination, port) => ((protocol, source), destination)
}.groupByKey().map {
case ((protocol, source), destinations) => (protocol, (source, destinations.toSeq))
}.groupByKey()
resultRDD.collect().foreach(println)
但是,您需要确保协议的数据适合内存。