Spark Scala中的映射

时间:2014-11-12 00:41:55

标签: scala apache-spark mapping

我是Spark和Scala的新手以及一般的这种编程。

我想要完成的是:

我的RDD是org.apache.spark.rdd.RDD ** [(Double,Iterable [String])] **

所以可能的内容可能是:

<1 , (A,B,C)>
<42, (A)    >
<0 , (C,D)  >

我需要以这种方式将其转换为新的RDD,因此我得到类似的输出:

<1, A>
<1, B>
<1, C>
<42, A>
<0, C>
<0, D>

这必须非常简单,但我尝试了很多不同的方法,但却无法做到。

2 个答案:

答案 0 :(得分:2)

您可以使用flatMapValues

import org.apache.spark.SparkContext._

val r : RDD[(Double, Iterable[String])] = ...
r.flatMapValues(x => x)

答案 1 :(得分:0)

让我们输入

(Name , List[Interest])

"Chandru",("Java","Scala","Python")
"Sriram", ("Science","Maths","Hadoop","C2","c3")
"Jai",("Flink","Scala","Haskell")
  

为此人创建案例类

 case class Person(name:String, interest:List[String])
  

创建输入

 val input={Seq(Person("Chandru",List("Java","Scala","Python")),Person("Sriram",List("Science","Maths","Hadoop","C2","c3")),Person("Jai",List("Flink","Scala","Haskell")))}

 val rdd=sc.parallelize(input)

 val mv=rdd.map(p=>(p.name,p.interest))

 val fmv=mv.flatMapValues(v=>v.toStream)

 fmv.collect

结果是:

  Array[(String, String)] = Array(
  (Chandru,Java), 
  (Chandru,Scala), 
  (Chandru,Python), 
  (Sriram,Science), 
  (Sriram,Maths), 
  (Sriram,Hadoop), 
  (Sriram,C2), 
  (Sriram,c3), 
  (Jai,Flink), 
  (Jai,Scala), 
  (Jai,Haskell))