
时间:2017-03-11 18:27:33

标签: scala apache-spark sbt

我正在尝试使用RDDListRDD和另一个scala生成spark。我们的想法是获取值列表,并生成一个索引,其中包含包含每个值的原始数据集的所有条目。 这是我正在尝试的代码

  def mcveInvertIndex(foos: List[String], bars: RDD[Int]): RDD[(String, Iterable[Int])] = {
    // filter function
    def hasVal(foo: String)(bar: Int): Boolean =
      foo.toInt == bar
    // call to sc.parallelize to ensure that an RDD is returned
    sc parallelize(
      foos map (_ match {
        case (s: String) => (s, bars filter hasVal(s))


> compile
[info] Compiling 1 Scala source to $TARGETDIR/target/scala-2.11/classes...
[error] $TARGETDIR/src/main/scala/wikipedia/WikipediaRanking.scala:56: type mismatch;
[error]  found   : List[(String, org.apache.spark.rdd.RDD[Int])]
[error]  required: Seq[(String, Iterable[Int])]
[error] Error occurred in an application involving default arguments.
[error]       foos map (_ match {
[error]            ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 1 s, completed Mar 11, 2017 7:11:31 PM

我真的不明白我得到的错误。 ListSeq的子类,我假设RDDIterable的子类。有没有明显的东西我错过了?

2 个答案:

答案 0 :(得分:3)


  def mcveInvertIndex(foos: List[String],
                      bars: RDD[Int]): RDD[(String, Iterable[Int])] = 

    // filter function
    def hasVal(foo: String, bar: Int): Boolean =
      foo.toInt == bar

    // Producing RDD[(String, Iterable[Int])]
    (for {
      bar <- bars // it's important to have RDD 
                  // at first position of for-comprehesion
                  // to produce the correct result type
      foo <- foos
      if hasVal(foo, bar)
    } yield (foo, bar)).groupByKey()

答案 1 :(得分:1)


def mcveInvertIndex(foos: List[String], bars: RDD[Int]): RDD[(String, Iterable[Int])] = {
        (agg: Iterable[Int], currVal: (String, Int))=>{
          if(currVal._1.toInt != currVal._2) agg
          else currVal._2 +: agg.toList
        _ ++ _ 