我正在尝试使用Spark的RDD类中定义的zipPartitions函数(此处为Spark Scala文档的URL:http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.RDD)。
该函数已重载,并包含多个实现。
def zipPartitions[B, C, D, V](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D])(f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[D], arg3: ClassTag[V]): RDD[V]
def zipPartitions[B, C, D, V](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[D], arg3: ClassTag[V]): RDD[V]
def zipPartitions[B, C, V](rdd2: RDD[B], rdd3: RDD[C])(f: (Iterator[T], Iterator[B], Iterator[C]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[V]): RDD[V]
def zipPartitions[B, C, V](rdd2: RDD[B], rdd3: RDD[C], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B], Iterator[C]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[V]): RDD[V]
def zipPartitions[B, V](rdd2: RDD[B])(f: (Iterator[T], Iterator[B]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]
def zipPartitions[B, V](rdd2: RDD[B], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]
我定义了一个带有类型签名的函数merge:
merge(iter1: Iterator[(Int,Int)], iter2: Iterator[(Int,Int)]): Iterator[(Int,Int)]
并且有两个类型为[Int]的RDD。
但是,当我执行Rdd1.zipPartitions(Rdd2,merge)时,spark shell会抛出错误并说:
error: missing arguments for method merge;
follow this method with `_' if you want to treat it as a partially applied function
这很奇怪,因为在其他地方,我能够将函数作为参数传递给另一个方法。但是,如果我添加两个_来合并,并尝试
Rdd1.zipPartitions(Rdd2,merge(_:Iterator [(Int,Int)],_:Iterator [(Int,Int)]),然后我得到一个不同的错误:
error: overloaded method value zipPartitions with alternatives:
[B, C, D, V](rdd2: org.apache.spark.rdd.RDD[B], rdd3: org.apache.spark.rdd.RDD[C], rdd4: org.apache.spark.rdd.RDD[D])(f: (Iterator[(Int, Int)], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V])(implicit evidence$34: scala.reflect.ClassTag[B], implicit evidence$35: scala.reflect.ClassTag[C], implicit evidence$36: scala.reflect.ClassTag[D], implicit evidence$37: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
[B, C, D, V](rdd2: org.apache.spark.rdd.RDD[B], rdd3: org.apache.spark.rdd.RDD[C], rdd4: org.apache.spark.rdd.RDD[D], preservesPartitioning: Boolean)(f: (Iterator[(Int, Int)], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V])(implicit evidence$30: scala.reflect.ClassTag[B], implicit evidence$31: scala.reflect.ClassTag[C], implicit evidence$32: scala.reflect.ClassTag[D], implicit evidence$33: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
[B, C, V](rdd2: org.apache.spark.rdd.RDD[B], rdd3: org.apache.spark.rdd.RDD[C])(f: (Iterator[(Int, Int)], Iterator[B], Iterator[C]) => Iterator[V])(implicit evidence$27: scala.reflect.ClassTag[B], implicit evidence$28: scala.reflect.ClassTag[C], implicit evidence$29: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
[B, C, V](rdd2: org.apache.spark.rdd.RDD[B], rdd3: org.apache.spark.rdd.RDD[C], preservesPartitioning: Boolean)(f: (Iterator[(Int, Int)], Iterator[B], Iterator[C]) => Iterator[V])(implicit evidence$24: scala.reflect.ClassTag[B], implicit evidence$25: scala.reflect.ClassTag[C], implicit evidence$26: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
[B, V](rdd2: org.apache.spark.rdd.RDD[B])(f: (Iterator[(Int, Int)], Iterator[B]) => Iterator[V])(implicit evidence$22: scala.reflect.ClassTag[B], implicit evidence$23: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V] <and>
[B, V](rdd2: org.apache.spark.rdd.RDD[B], preservesPartitioning: Boolean)(f: (Iterator[(Int, Int)], Iterator[B]) => Iterator[V])(implicit evidence$20: scala.reflect.ClassTag[B], implicit evidence$21: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V]
cannot be applied to (org.apache.spark.rdd.RDD[(Int, Int)], (Iterator[(Int, Int)], Iterator[(Int, Int)]) => Iterator[(Int, Int)])
val RDD_combined = RDD1.zipPartitions(RDD1:org.apache.spark.rdd.RDD[(Int, Int)],merge(_:Iterator[(Int,Int)],_:Iterator[(Int,Int)]))
我怀疑错误在于这个底线:
我尝试与此调用匹配的函数定义是:
[B, V](rdd2: org.apache.spark.rdd.RDD[B])(f: (Iterator[(Int, Int)], Iterator[B]) => Iterator[V])(implicit evidence$22: scala.reflect.ClassTag[B], implicit evidence$23: scala.reflect.ClassTag[V])org.apache.spark.rdd.RDD[V]
然而,scala看到的是
val RDD_combined = RDD1.zipPartitions(RDD1:org.apache.spark.rdd.RDD[(Int, Int)],merge(_:Iterator[(Int,Int)],_:Iterator[(Int,Int)]))
其中[B]类型参数已转换为[(Int,Int)]。
任何有关如何使其发挥作用的见解将非常感谢!
答案 0 :(得分:2)
如果查看签名,您会发现这实际上是一个包含多个参数列表的函数,而不是一个包含多个参数的列表。您需要的调用更像是:
RDD1.zipPartitions(RDD1)(merge)
(不确定您在原件中添加的类型参考?)
您可能还需要进行一些其他调整才能完成此工作,但这是修复您当前看到的错误的关键。