spark rdd按元素类过滤

时间:2017-06-10 13:33:48

标签: scala apache-spark

我有一个带有不同类型元素的RDD,我想按类型计算它们,例如,下面的代码可以正常工作。

draftId

现在,如果元素是用户定义的类型,下面的代码将无法按预期工作。

scala> val rdd = sc.parallelize(List(1, 2.0, "abc"))
rdd: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.filter{case z:Int => true; case _ => false}.count
res0: Long = 1

scala> rdd.filter{case z:String => true; case _ => false}.count
res1: Long = 1

我在这里错过了什么吗?求救!

1 个答案:

答案 0 :(得分:3)

这看起来像Spark-1199的变体,可能是REPL错误。

这会在IDEA内部本地运行时产生预期的行为:

import org.apache.spark.SparkContext

class TypeA extends Serializable
case class TypeB(id:Long) extends TypeA
case class TypeC(name:String) extends TypeA

val sc = new SparkContext("local[*]", "swe")
val rdd = sc.parallelize(List(TypeB(12), TypeC("Hsa")))

rdd.filter { case x: TypeB => true; case _ => false }.count()

收率:

import org.apache.spark.SparkContext

defined class TypeA
defined class TypeB
defined class TypeC   

sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@10a1410d
rdd: org.apache.spark.rdd.RDD[TypeA with Product] = ParallelCollectionRDD[0] at parallelize at <console>:18

[Stage 0:>....... (0 + 0) / 4]
res0: Long = 1