我正在尝试使用RDD
和List
从RDD
和另一个scala
生成spark
。我们的想法是获取值列表,并生成一个索引,其中包含包含每个值的原始数据集的所有条目。
这是我正在尝试的代码
def mcveInvertIndex(foos: List[String], bars: RDD[Int]): RDD[(String, Iterable[Int])] = {
// filter function
def hasVal(foo: String)(bar: Int): Boolean =
foo.toInt == bar
// call to sc.parallelize to ensure that an RDD is returned
sc parallelize(
foos map (_ match {
case (s: String) => (s, bars filter hasVal(s))
})
)
}
不幸的是,这不能在sbt
> compile
[info] Compiling 1 Scala source to $TARGETDIR/target/scala-2.11/classes...
[error] $TARGETDIR/src/main/scala/wikipedia/WikipediaRanking.scala:56: type mismatch;
[error] found : List[(String, org.apache.spark.rdd.RDD[Int])]
[error] required: Seq[(String, Iterable[Int])]
[error] Error occurred in an application involving default arguments.
[error] foos map (_ match {
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 1 s, completed Mar 11, 2017 7:11:31 PM
我真的不明白我得到的错误。 List
是Seq
的子类,我假设RDD
是Iterable
的子类。有没有明显的东西我错过了?
答案 0 :(得分:3)
这是我理解的解决方案(应该使用比笛卡尔产品更少的内存)
def mcveInvertIndex(foos: List[String],
bars: RDD[Int]): RDD[(String, Iterable[Int])] =
{
// filter function
def hasVal(foo: String, bar: Int): Boolean =
foo.toInt == bar
// Producing RDD[(String, Iterable[Int])]
(for {
bar <- bars // it's important to have RDD
// at first position of for-comprehesion
// to produce the correct result type
foo <- foos
if hasVal(foo, bar)
} yield (foo, bar)).groupByKey()
}
答案 1 :(得分:1)
如评论中所述,RDD不是Iterable,因此您必须以某种方式将两者合并,然后将它们聚合在一起。这是我的快速解决方案,尽管可能有更有效的方法:
def mcveInvertIndex(foos: List[String], bars: RDD[Int]): RDD[(String, Iterable[Int])] = {
sc.makeRDD(foos)
.cartesian(bars)
.keyBy(x=>x._1)
.aggregateByKey(Iterable.empty[Int])(
(agg: Iterable[Int], currVal: (String, Int))=>{
if(currVal._1.toInt != currVal._2) agg
else currVal._2 +: agg.toList
},
_ ++ _
)
}