使用Spark REPL和独立Spark程序时的不同行为

时间:2014-06-06 16:21:48

标签: scala apache-spark

当我通过Spark REPL运行此代码时:

  val sc = new SparkContext("local[4]" , "")

  val x = sc.parallelize(List( ("a" , "b" , 1) , ("a" , "b" , 1) , ("c" , "b" , 1) , ("a" , "d" , 1)))

  val byKey = x.map({case (sessionId,uri,count) => (sessionId,uri)->count})
  val reducedByKey = byKey.reduceByKey(_ + _ , 2)

  val grouped = byKey.groupByKey
  val count = grouped.map{case ((sessionId,uri),count) => ((sessionId),(uri,count.sum))}
  val grouped2 = count.groupByKey

REPL将分组2的类型显示为:

grouped2: org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])] 

但是,如果我在Spark程序中使用相同的代码,则会返回group2的不同类型,如此错误所示:

type mismatch;
  found   : org.apache.spark.rdd.RDD[(String, Iterable[(String, Int)])]
  required: org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])]
  Note: (String, Iterable[(String, Int)]) >: (String, Seq[(String, Int)]), but class RDD is invariant in type T.
    You may wish to define T as -T instead. (SLS 4.5)
  val grouped2 :  org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])] = count.groupByKey

这是独立模式的完整代码:

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._

object Tester extends App {

  val sc = new SparkContext("local[4]" , "")

  val x = sc.parallelize(List( ("a" , "b" , 1) , ("a" , "b" , 1) , ("c" , "b" , 1) , ("a" , "d" , 1)))

  val byKey = x.map({case (sessionId,uri,count) => (sessionId,uri)->count})
  val reducedByKey = byKey.reduceByKey(_ + _ , 2)

  val grouped = byKey.groupByKey
  val count = grouped.map{case ((sessionId,uri),count) => ((sessionId),(uri,count.sum))}
  val grouped2 : org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])] = count.groupByKey

}

REPL和Standalone中返回的类型不一样吗?

更新:在独立分组2中推断为RDD[(String, Iterable[Nothing])]所以 val grouped2: RDD[(String, Iterable[Nothing])] = count.groupByKey compiles

因此根据程序的运行方式返回三种可能的类型?

更新2:IntelliJ似乎错误地推断出类型:

val x : org.apache.spark.rdd.RDD[(String, (String, Int))] = sc.parallelize(List( ("a" , ("b" , 1)) , ("a" , ("b" , 1))))

val grouped = x.groupByKey()

IntelliJ推断groupedorg.apache.spark.rdd.RDD[(String, Iterable[Nothing])]

应该是org.apache.spark.rdd.RDD[(String, Iterable[(String, Int)])](这是Spark REPL版本1.0推断的)

1 个答案:

答案 0 :(得分:1)

为了完整起见:Spark API在此处更改为0.9和1.0之间,groupByKey现在返回一对Iterable作为其第二个成员而不是Seq。< / p>

关于IntelliJ问题 - 遗憾的是,它并不太难以混淆IntelliJ的类型推断。如果它出现Nothing,那很可能是错误的。