我不知道是否可能,但我想在我的mapPartitions中将变量分成两个列表" a"。就像这里有一个列表l存储所有数字和另一个列表让我们说b存储所有单词。像a.mapPartitions((p,v) =>{ val l = p.toList; val b = v.toList; ....}
例如在我的for循环中l(i)= 1和b(i)="得分"
import scala.io.Source
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ListBuffer
val a = sc.parallelize(List(("score",1),("chicken",2),("magnacarta",2)) )
a.mapPartitions(p =>{val l = p.toList;
val ret = new ListBuffer[Int]
val words = new ListBuffer[String]
for(i<-0 to l.length-1){
words+= b(i)
ret += l(i)
}
ret.toList.iterator
}
)
答案 0 :(得分:1)
Spark是一种分布式计算引擎。您可以跨群集的节点对分区数据执行操作。然后,您需要一个执行摘要操作的Reduce()方法。
请参阅此代码,该代码可以执行您想要的操作:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object SimpleApp {
class MyResponseObj(var numbers: List[Int] = List[Int](), var words: List[String] = List[String]()) extends java.io.Serializable{
def +=(str: String, int: Int) = {
numbers = numbers :+ int
words = words :+ str
this
}
def +=(other: MyResponseObj) = {
numbers = numbers ++ other.numbers
words = words ++ other.words
this
}
}
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val a = sc.parallelize(List(("score", 1), ("chicken", 2), ("magnacarta", 2)))
val myResponseObj = a.mapPartitions[MyResponseObj](it => {
var myResponseObj = new MyResponseObj()
it.foreach {
case (str :String, int :Int) => myResponseObj += (str, int)
case _ => println("unexpected data")
}
Iterator(myResponseObj)
}).reduce( (myResponseObj1, myResponseObj2) => myResponseObj1 += myResponseObj2 )
println(myResponseObj.words)
println(myResponseObj.numbers)
}
}