SparkMap [RDD]-映射到多个

时间:2019-12-09 12:38:01

标签: scala apache-spark

我有以下格式的rdd:

[ ("a") -> (pos3, pos5), ("b") -> (pos1, pos7), .... ]

(pos1 ,pos2, ............, posn)

问:如何将每个位置映射到其键?(如下所示)

("b", "e", "a", "d", "a" .....) 
// "b" correspond to pos 1, "e" correspond to pose 2 and ... 

示例(编辑):

// chunk of my data 
val data = Vector(("a",(124)), ("b",(125)), ("c",(121, 123)), ("d",(122)),..)
val rdd = sc.parallelize(data)


// from rdd I can create my position rdd which is something like: 
val positions = Vector(1,2,3,4,.......125) // my positions

// I want to map each position to my tokens("a", "b", "c", ....) to achive:
Vector("a", "b", "a", ...)
// a correspond to pos1, b correspond to pos2 ...

1 个答案:

答案 0 :(得分:2)

不确定您必须使用Spark来解决此特定用例(以Vector开头,以包含所有数据字符的Vector结尾)。

不过,这里有一些建议可以满足您的需求:

val data = Vector(("a",Set(124)), ("b", Set(125)), ("c", Set(121, 123)), ("d", Set(122)))
val rdd = spark.sparkContext.parallelize(data)

val result = rdd.flatMap{case (k,positions) => positions.map(p => Map(p -> k))}
      .reduce(_ ++ _) //here, we aggregate the Map objects together, reducing partitions first and then merging executors results
      .toVector
      .sortBy(_._1) //We sort data based on position
      .map(_._2) // We only keep characters
      .mkString