Spark how to group by arbitrary number of keys?

时间:2015-04-23 05:19:16

标签: scala apache-spark

If I have an RDD of ("a","b","c")

and key generator is something like

def keygen(x:String) = x match {
  case "a" => Seq("x","y")
  case "b" => Seq("x")
  case "c" => Seq()
}

How to get an key-value RDD of ("x"->Seq("a","b"),"y"->Seq("b"))

my way to do this.

val sample = sc.parallelize(Seq("a", "b", "c"))

def keygen(x: String) = x match {
  case "a" => Seq("x", "y")
  case "b" => Seq("x")
  case "c" => Seq()
}
val sampleWithKey = sample.flatMap(x => keygen(x).map(y => (y, x))).groupBy(_._1).mapValues(_.map(_._2))
val result = sampleWithKey.collect()
println("result: ", result.mkString("(", ",", ")"))

get (x,List(a, b)),(y,List(a))

1 个答案:

答案 0 :(得分:0)

嗯......看起来有点奇怪,但你可以按照以下方式实现这个目标,

def keygen(x:String) = x match {
  case "a" => Seq("x","y")
  case "b" => Seq("x")
  case "c" => Seq("Empty")
}


val stringRdd = s.parallelize( List( "a", "b", "c" ) )
// RDD[ "a", "b", "c" ]

val keyedRdd = stringRdd.map( string => ( keygen( string ), string ) )
// RDD[ ( Seq("x",  "y"), a ), ( Seq("x"), "b" ), ( Seq("Empty"), "c" ) ]

val keyFlatRdd = keyedRdd
  .flatMap( { case ( keySeq, string ) =>  keySeq.map( key => ( key, string ) ) } )
  .filter( { case ( key, string ) => !key.equalsIgnoreCase( "Empty" ) } )
// RDD[ ("x", "a"), ("y", "a"), ("x", "b") ]

val finalRdd = keyFlatRdd
  .groupBy( { case( key, string ) => key }
  .map( { case ( key, seq ) => ( key, seq.map( _._2 ) ) } )
// RDD[ ( "x", Seq("a", "b") ), ( "y", Seq("a") ) ]