我是Spark和Scala的新手以及一般的这种编程。
我想要完成的是:
我的RDD是org.apache.spark.rdd.RDD ** [(Double,Iterable [String])] **
所以可能的内容可能是:
<1 , (A,B,C)>
<42, (A) >
<0 , (C,D) >
我需要以这种方式将其转换为新的RDD,因此我得到类似的输出:
<1, A>
<1, B>
<1, C>
<42, A>
<0, C>
<0, D>
这必须非常简单,但我尝试了很多不同的方法,但却无法做到。
答案 0 :(得分:2)
您可以使用flatMapValues
:
import org.apache.spark.SparkContext._
val r : RDD[(Double, Iterable[String])] = ...
r.flatMapValues(x => x)
答案 1 :(得分:0)
让我们输入
(Name , List[Interest])
,
"Chandru",("Java","Scala","Python")
"Sriram", ("Science","Maths","Hadoop","C2","c3")
"Jai",("Flink","Scala","Haskell")
为此人创建案例类
case class Person(name:String, interest:List[String])
创建输入
val input={Seq(Person("Chandru",List("Java","Scala","Python")),Person("Sriram",List("Science","Maths","Hadoop","C2","c3")),Person("Jai",List("Flink","Scala","Haskell")))}
val rdd=sc.parallelize(input)
val mv=rdd.map(p=>(p.name,p.interest))
val fmv=mv.flatMapValues(v=>v.toStream)
fmv.collect
结果是:
Array[(String, String)] = Array(
(Chandru,Java),
(Chandru,Scala),
(Chandru,Python),
(Sriram,Science),
(Sriram,Maths),
(Sriram,Hadoop),
(Sriram,C2),
(Sriram,c3),
(Jai,Flink),
(Jai,Scala),
(Jai,Haskell))