重载的Spark RDD函数zipPartition中的错误

时间:2014-05-14 00:54:31

标签: apache-spark

我正在尝试使用Spark的RDD类中定义的zipPartitions函数(此处为Spark Scala文档的URL:http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.RDD)。

该函数已重载,并包含多个实现。

def    zipPartitions[B, C, D, V](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D])(f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[D], arg3: ClassTag[V]): RDD[V]
def    zipPartitions[B, C, D, V](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[D], arg3: ClassTag[V]): RDD[V]
def    zipPartitions[B, C, V](rdd2: RDD[B], rdd3: RDD[C])(f: (Iterator[T], Iterator[B], Iterator[C]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[V]): RDD[V]
def    zipPartitions[B, C, V](rdd2: RDD[B], rdd3: RDD[C], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B], Iterator[C]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[V]): RDD[V]
def    zipPartitions[B, V](rdd2: RDD[B])(f: (Iterator[T], Iterator[B]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]
def    zipPartitions[B, V](rdd2: RDD[B], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]

我定义了一个带有类型签名的函数merge:

merge(iter1: Iterator[(Int,Int)], iter2: Iterator[(Int,Int)]): Iterator[(Int,Int)]

并且有两个类型为[Int]的RDD。

但是,当我执行Rdd1.zipPartitions(Rdd2,merge)时,spark shell会抛出错误并说:

error: missing arguments for method merge;
follow this method with `_' if you want to treat it as a partially applied function

这很奇怪,因为在其他地方,我能够将函数作为参数传递给另一个方法。但是,如果我添加两个_来合并,并尝试

Rdd1.zipPartitions(Rdd2,merge(_:Iterator [(Int,Int)],_:Iterator [(Int,Int)]),然后我得到一个不同的错误:

error: overloaded method value zipPartitions with alternatives:
  [B, C, D, V](rdd2: org.apache.spark.rdd.RDD[B], rdd3: org.apache.spark.rdd.RDD[C], rdd4: org.apache.spark.rdd.RDD[D])(f: (Iterator[(Int, Int)], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V])(implicit evidence$34: scala.reflect.ClassTag[B], implicit evidence$35: scala.reflect.ClassTag[C], implicit evidence$36: scala.reflect.ClassTag[D], implicit evidence$37: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
  [B, C, D, V](rdd2: org.apache.spark.rdd.RDD[B], rdd3: org.apache.spark.rdd.RDD[C], rdd4: org.apache.spark.rdd.RDD[D], preservesPartitioning: Boolean)(f: (Iterator[(Int, Int)], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V])(implicit evidence$30: scala.reflect.ClassTag[B], implicit evidence$31: scala.reflect.ClassTag[C], implicit evidence$32: scala.reflect.ClassTag[D], implicit evidence$33: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
  [B, C, V](rdd2: org.apache.spark.rdd.RDD[B], rdd3: org.apache.spark.rdd.RDD[C])(f: (Iterator[(Int, Int)], Iterator[B], Iterator[C]) => Iterator[V])(implicit evidence$27: scala.reflect.ClassTag[B], implicit evidence$28: scala.reflect.ClassTag[C], implicit evidence$29: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
  [B, C, V](rdd2: org.apache.spark.rdd.RDD[B], rdd3: org.apache.spark.rdd.RDD[C], preservesPartitioning: Boolean)(f: (Iterator[(Int, Int)], Iterator[B], Iterator[C]) => Iterator[V])(implicit evidence$24: scala.reflect.ClassTag[B], implicit evidence$25: scala.reflect.ClassTag[C], implicit evidence$26: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
  [B, V](rdd2: org.apache.spark.rdd.RDD[B])(f: (Iterator[(Int, Int)], Iterator[B]) => Iterator[V])(implicit evidence$22: scala.reflect.ClassTag[B], implicit evidence$23: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
  [B, V](rdd2: org.apache.spark.rdd.RDD[B], preservesPartitioning: Boolean)(f: (Iterator[(Int, Int)], Iterator[B]) => Iterator[V])(implicit evidence$20: scala.reflect.ClassTag[B], implicit evidence$21: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V]
 cannot be applied to (org.apache.spark.rdd.RDD[(Int, Int)], (Iterator[(Int, Int)], Iterator[(Int, Int)]) => Iterator[(Int, Int)])

       val RDD_combined = RDD1.zipPartitions(RDD1:org.apache.spark.rdd.RDD[(Int, Int)],merge(_:Iterator[(Int,Int)],_:Iterator[(Int,Int)]))

我怀疑错误在于这个底线:

我尝试与此调用匹配的函数定义是:

[B, V](rdd2: org.apache.spark.rdd.RDD[B])(f: (Iterator[(Int, Int)], Iterator[B]) => Iterator[V])(implicit evidence$22: scala.reflect.ClassTag[B], implicit evidence$23: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V]
然而,scala看到的是

val RDD_combined = RDD1.zipPartitions(RDD1:org.apache.spark.rdd.RDD[(Int, Int)],merge(_:Iterator[(Int,Int)],_:Iterator[(Int,Int)]))

其中[B]类型参数已转换为[(Int,Int)]。

任何有关如何使其发挥作用的见解将非常感谢!

1 个答案:

答案 0 :(得分:2)

如果查看签名,您会发现这实际上是一个包含多个参数列表的函数,而不是一个包含多个参数的列表。您需要的调用更像是:

RDD1.zipPartitions(RDD1)(merge)

(不确定您在原件中添加的类型参考?)

您可能还需要进行一些其他调整才能完成此工作,但这是修复您当前看到的错误的关键。