If I have an RDD of ("a","b","c")
and key generator is something like
def keygen(x:String) = x match {
case "a" => Seq("x","y")
case "b" => Seq("x")
case "c" => Seq()
}
How to get an key-value RDD of ("x"->Seq("a","b"),"y"->Seq("b"))
my way to do this.
val sample = sc.parallelize(Seq("a", "b", "c"))
def keygen(x: String) = x match {
case "a" => Seq("x", "y")
case "b" => Seq("x")
case "c" => Seq()
}
val sampleWithKey = sample.flatMap(x => keygen(x).map(y => (y, x))).groupBy(_._1).mapValues(_.map(_._2))
val result = sampleWithKey.collect()
println("result: ", result.mkString("(", ",", ")"))
get (x,List(a, b)),(y,List(a))
答案 0 :(得分:0)
嗯......看起来有点奇怪,但你可以按照以下方式实现这个目标,
def keygen(x:String) = x match {
case "a" => Seq("x","y")
case "b" => Seq("x")
case "c" => Seq("Empty")
}
val stringRdd = s.parallelize( List( "a", "b", "c" ) )
// RDD[ "a", "b", "c" ]
val keyedRdd = stringRdd.map( string => ( keygen( string ), string ) )
// RDD[ ( Seq("x", "y"), a ), ( Seq("x"), "b" ), ( Seq("Empty"), "c" ) ]
val keyFlatRdd = keyedRdd
.flatMap( { case ( keySeq, string ) => keySeq.map( key => ( key, string ) ) } )
.filter( { case ( key, string ) => !key.equalsIgnoreCase( "Empty" ) } )
// RDD[ ("x", "a"), ("y", "a"), ("x", "b") ]
val finalRdd = keyFlatRdd
.groupBy( { case( key, string ) => key }
.map( { case ( key, seq ) => ( key, seq.map( _._2 ) ) } )
// RDD[ ( "x", Seq("a", "b") ), ( "y", Seq("a") ) ]