如何根据Spark中另一个RDD的函数过滤RDD?

时间:2014-09-25 09:52:47

标签: scala map apache-spark rdd

我是Apache Spark的初学者。我想过滤掉权重总和大于RDD中的常量值的所有组。 “权重”地图也是RDD。这是一个小型演示,要过滤的组存储在“组”中,常量值为12:

val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
val wm = weights.toArray.toMap
def isheavy(inp: String): Boolean = {
  val allw = inp.split(",").map(wm(_)).sum
  allw > 12
}
val result = groups.filter(isheavy)

当输入数据非常大时,>以10GB为例,我总是遇到“java heap out of memory”错误。我怀疑它是否由“weights.toArray.toMap”引起,因为它将分布式RDD转换为JVM中的Java对象。所以我试着直接用RDD过滤:

val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
def isheavy(inp: String): Boolean = {
  val items = inp.split(",")
  val wm = items.map(x => weights.filter(_._1 == x).first._2)
  wm.sum > 12
}
val result = groups.filter(isheavy)

在将此脚本加载到spark shell后运行result.collect时,出现“java.lang.NullPointerException”错误。有人告诉我,当RDD在另一个RDD中被操纵时,会出现一个nullpointer异常,并建议我将权重放入Redis。

那么如何在不将“权重”转换为Map或将其放入Redis的情况下获取“结果”?如果有一个解决方案可以在没有外部数据存储区服务的帮助下基于另一个类似地图的RDD过滤RDD? 谢谢!

2 个答案:

答案 0 :(得分:4)

假设您的群组是唯一的。否则,首先通过distinct等使其独特。 如果组或权重很小,那应该很容易。如果组和权重都很大,你可以试试这个,这可能更具可扩展性,但看起来也很复杂。

val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
//map groups to be (a, (a,b,c,d)), (b, (a,b,c,d), (c, (a,b,c,d)....
val g1=groups.flatMap(s=>s.split(",").map(x=>(x, Seq(s))))
//j will be (a, ((a,b,c,d),3)...
val j = g1.join(weights)
//k will be ((a,b,c,d), 3), ((a,b,c,d),2) ...
val k = j.map(x=>(x._2._1, x._2._2))
//l will be ((a,b,c,d), (3,2,5,1))...
val l = k.groupByKey()
//filter by sum the 2nd
val m = l.filter(x=>{var sum = 0; x._2.foreach(a=> {sum=sum+a});sum > 12})
//we only need the original list
val result=m.map(x=>x._1)
//don't do this in real product, otherwise, all results go to driver.instead using saveAsTextFile, etc
scala> result.foreach(println)
List(e,g)
List(b,c,e)

答案 1 :(得分:2)

“java out of memory”错误即将发生,因为spark在确定拆分数时会使用其spark.default.parallelism属性,默认情况下是可用的核心数。

// From CoarseGrainedSchedulerBackend.scala

override def defaultParallelism(): Int = {
   conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
}

当输入变大且内存有限时,应增加分割数。

您可以执行以下操作:

 val input = List("a,b,c,d", "b,c,e", "a,c,d", "e,g") 
 val splitSize = 10000 // specify some number of elements that fit in memory.

 val numSplits = (input.size / splitSize) + 1 // has to be > 0.
 val groups = sc.parallelize(input, numSplits) // specify the # of splits.

 val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)).toMap

 def isHeavy(inp: String) = inp.split(",").map(weights(_)).sum > 12
 val result = groups.filter(isHeavy)

您还可以考虑使用spark.executor.memory增加执行程序内存大小。