Question

我正在尝试使用RDD和List从RDD和另一个scala生成spark。我们的想法是获取值列表，并生成一个索引，其中包含包含每个值的原始数据集的所有条目。这是我正在尝试的代码

  def mcveInvertIndex(foos: List[String], bars: RDD[Int]): RDD[(String, Iterable[Int])] = {
    // filter function
    def hasVal(foo: String)(bar: Int): Boolean =
      foo.toInt == bar
    // call to sc.parallelize to ensure that an RDD is returned
    sc parallelize(
      foos map (_ match {
        case (s: String) => (s, bars filter hasVal(s))
      })
    )
  }

不幸的是，这不能在sbt

中编译

> compile
[info] Compiling 1 Scala source to $TARGETDIR/target/scala-2.11/classes...
[error] $TARGETDIR/src/main/scala/wikipedia/WikipediaRanking.scala:56: type mismatch;
[error]  found   : List[(String, org.apache.spark.rdd.RDD[Int])]
[error]  required: Seq[(String, Iterable[Int])]
[error] Error occurred in an application involving default arguments.
[error]       foos map (_ match {
[error]            ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 1 s, completed Mar 11, 2017 7:11:31 PM

我真的不明白我得到的错误。 List是Seq的子类，我假设RDD是Iterable的子类。有没有明显的东西我错过了？

Answer 1

这是我理解的解决方案（应该使用比笛卡尔产品更少的内存）

  def mcveInvertIndex(foos: List[String],
                      bars: RDD[Int]): RDD[(String, Iterable[Int])] = 
  {

    // filter function
    def hasVal(foo: String, bar: Int): Boolean =
      foo.toInt == bar

    // Producing RDD[(String, Iterable[Int])]
    (for {
      bar <- bars // it's important to have RDD 
                  // at first position of for-comprehesion
                  // to produce the correct result type
      foo <- foos
      if hasVal(foo, bar)
    } yield (foo, bar)).groupByKey()
  }

Answer 2

如评论中所述，RDD不是Iterable，因此您必须以某种方式将两者合并，然后将它们聚合在一起。这是我的快速解决方案，尽管可能有更有效的方法：

def mcveInvertIndex(foos: List[String], bars: RDD[Int]): RDD[(String, Iterable[Int])] = {
    sc.makeRDD(foos)
      .cartesian(bars)
      .keyBy(x=>x._1)
      .aggregateByKey(Iterable.empty[Int])(
        (agg: Iterable[Int], currVal: (String, Int))=>{
          if(currVal._1.toInt != currVal._2) agg
          else currVal._2 +: agg.toList
        }, 
        _ ++ _ 
     )
  }

如何使用scala和spark将列表转换为RDD

2 个答案: