如何在Spark Streaming中处理Tuple?

时间:2018-09-12 02:59:30

标签: scala apache-spark apache-kafka spark-streaming

Spark Scala出现问题,我想在Spark流中增加元组元素,我将数据从kafka传输到dstream,我的RDD数据就是这样,

(2,[2,3,4,6,5])
(4,[2,3,4,6,5])
(7,[2,3,4,6,5])
(9,[2,3,4,6,5])

我想使用像这样的乘法运算,

 (2,[2*2,3*2,4*2,6*2,5*2])
 (4,[2*4,3*4,4*4,6*4,5*4])
 (7,[2*7,3*7,4*7,6*7,5*7])
 (9,[2*9,3*9,4*9,6*9,5*9])

然后,我得到这样的rdd,

 (2,[4,6,8,12,10])
 (4,[8,12,16,24,20])
 (7,[14,21,28,42,35])
 (9,[18,27,36,54,45])

最后,像这样,我使Tuple成为第二个最小的元素,

 (2,4)
 (4,8)
 (7,14)
 (9,18)

如何使用dstream中的scala做到这一点?我使用的是Spark 1.6版

2 个答案:

答案 0 :(得分:1)

通过scala给您演示

// val conf = new SparkConf().setAppName("ttt").setMaster("local")
//val  sc = new SparkContext(conf)
// val data =Array("2,2,3,4,6,5","4,2,3,4,6,5","7,2,3,4,6,5","9,2,3,4,6,5")
//val  lines  = sc.parallelize(data)
//change to your data  (each RDD in streaming)
    lines.map(x => (x.split(",")(0).toInt,List(x.split(",")(1).toInt,x.split(",")(2).toInt,x.split(",")(3).toInt,x.split(",")(4).toInt,x.split(",")(5).toInt) ))
      .map(x =>(x._1 ,x._2.min)).map(x => (x._1,x._2* x._1)).foreach(x => println(x))

这是结果

(2,4)
(4,8)
(7,14)
(9,18)

DStream中的每个RDD都包含特定时间间隔的数据,您可以根据需要操纵每个RDD

答案 1 :(得分:0)

比方说,您在变量 input 中得到元组rdd:

import scala.collection.mutable.ListBuffer    

val result = input
.map(x => {                           // for each element
   var l = new ListBuffer[Int]()      // create a new list for storing the multiplication result
   for(i <- x._1){                    // for each element in the array
      l += x._0 * i                   // append the multiplied result to the new list
   }
   (x._0, l.toList)                  // return the new tuple
})
.map(x => {
   (x._0, x._1.min)                  // return the new tuple with the minimum element in it from the list
})

result.foreach(println)应该导致:

(2,4)
(4,8)
(7,14)
(9,18)