迭代cogrouped RDD

时间:2015-10-29 18:47:59

标签: scala apache-spark

我使用了cogroup函数并获得了以下RDD:

org.apache.spark.rdd.RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]

在地图操作之前,连接对象将如下所示:

RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]

(-2095842000,(CompactBuffer((1504999740,1430096464017), (613904354,1430211912709), (-1514234644,1430288363100), (-276850688,1430330412225)),CompactBuffer((-511732877,1428682217564), (1133633791,1428831320960), (1168566678,1428964645450), (-407341933,1429009306167), (-1996133514,1429016485487), (872888282,1429031501681), (-826902224,1429034491003), (818711584,1429111125268), (-1068875079,1429117498135), (301875333,1429121399450), (-1730846275,1429131773065), (1806256621,1429135583312))))
(352234000,(CompactBuffer((1350763226,1430006650167), (-330160951,1430320010314)),CompactBuffer((2113207721,1428994842593), (-483470471,1429324209560), (1803928603,1429426861915))))

现在我想做以下事情:

val globalBuffer = ListBuffer[Double]()
val joined = data1.cogroup(data2).map(x => {
  val listA = x._2._1.toList
  val listB = x._2._2.toList
  for(tupleB <- listB) {
    val localResults = ListBuffer[Double]()
    val itemToTest = Set(tupleB._1)
    val tempList = ListBuffer[(Int, Double)]()
    for(tupleA <- listA) {
      val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
      val i = (tupleA._1, tValue)
      tempList += i
    }
    val sortList = tempList.sortWith(_._2 > _._2).slice(0,20).map(i => i._1)
    val intersect = sortList.toSet.intersect(itemToTest)
    if (intersect.size > 0)
      localResults += 1.0
    else localResults += 0.0
    val normalized = sum(localResults.toList)/localResults.size
    globalBuffer += normalized
  }
})

//method sum
def sum(xs: List[Double]): Double = {//do the sum}

最后,我期待加入成为具有双值的列表。但当我看着它时,它就是单位。我也不会这样做Scala的做法。如何获得globalBuffer作为最终结果。

2 个答案:

答案 0 :(得分:1)

嗯,如果我正确理解你的代码,它可以从这些改进中受益:

val joined = data1.cogroup(data2).map(x => {
  val listA = x._2._1.toList
  val listB = x._2._2.toList
  val localResults = listB.map { 
    case (intBValue, longBValue) =>
    val itemToTest = intBValue // it's always one element
    val tempList = listA.map {
       case (intAValue, longAValue) =>
       (intAValue, someFunctionReturnDouble(longBvalue, longAValue))
    }
    val sortList = tempList.sortWith(-_._2).slice(0,20).map(i => i._1)
    if (sortList.toSet.contains(itemToTest)) { 1.0 } else {0.0}
// no real need to convert to a set for 20 elements, by the way
  }
  sum(localResults)/localResults.size
})

答案 1 :(得分:1)

RDDs的转换不会修改globalBuffer globalBuffer的副本已发送并发送给每个工作人员,但对工作人员的这些副本的任何修改都不会修改驱动程序上存在的globalBuffer(您在map RDD之外定义的一个。)这是我的工作(进行了一些额外的修改):

val joined = data1.cogroup(data2) map { x =>
  val iterA = x._2._1
  val iterB = x._2._2
  var count, positiveCount = 0
  val tempList = ListBuffer[(Int, Double)]()
  for (tupleB <- iterB) {
    tempList.clear
    for(tupleA <- iterA) {
      val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
      tempList += ((tupleA._1, tValue))
    }
    val sortList = tempList.sortWith(_._2 > _._2).iterator.take(20)
    if (sortList.exists(_._1 == tupleB._1)) positiveCount += 1
    count += 1
  }
  positiveCount.toDouble/count
}

此时,您可以使用joined.collect获取比例的本地副本。