spark:如何有效地合并洗牌的rdd?

时间:2016-09-29 05:11:49

标签: apache-spark rdd

我有5个改组的键值rdds,一个大的(1,000,000条记录)和4个相对较小的(100,000条记录)。所有的rdds都是shullfed,具有相同数量的分区,我有两个策略来合并5个,

  1. 将5个合并在一起
  2. 将4个小rdd合并在一起然后加入bigone
  3. 我认为策略2会更有效率,因为不会重新洗牌那么重要。但实验结果表明策略1 更有效。代码和输出如下:

    代码

    import org.apache.log4j.{Level, Logger}
    import org.apache.spark.sql.SQLContext
    import org.apache.spark.{SparkContext, SparkConf}
    
    
    object MergeStrategy extends App {
    
        Logger.getLogger("org").setLevel(Level.ERROR)
        Logger.getLogger("akka").setLevel(Level.ERROR)
    
        val conf = new SparkConf().setMaster("local[4]").setAppName("test")
        val sc = new SparkContext(conf)
        val sqlContext = new SQLContext(sc)
    
        val bigRddSize = 1e6.toInt
        val smallRddSize = 1e5.toInt
        println(bigRddSize)
    
        val bigRdd = sc.parallelize((0 until bigRddSize)
            .map(x => (scala.util.Random.nextInt, 0))).repartition(100).cache
        bigRdd.take(10).foreach(println)
    
        val smallRddList = (0 until 4).map(i => {
            val rst = sc.parallelize((0 until smallRddSize)
                .map(x => (scala.util.Random.nextInt, 0))).repartition(100).cache
            println(rst.count)
            rst
        }).toArray
    
        // strategy 1
        {
            val begin = System.currentTimeMillis
    
            val s1Rst = sc.union(Array(bigRdd) ++ smallRddList).distinct(100)
            println(s1Rst.count)
    
            val end = System.currentTimeMillis
            val timeCost = (end - begin) / 1000d
            println("S1 time count: %.1f s".format(timeCost))
        }
    
        // strategy 2
        {
            val begin = System.currentTimeMillis
    
            val smallMerged = sc.union(smallRddList).distinct(100).cache
            println(smallMerged.count)
    
            val s2Rst = bigRdd.fullOuterJoin(smallMerged).flatMap({ case (key, (left, right)) => {
                if (left.isDefined && right.isDefined) Array((key, left.get), (key, right.get)).distinct
                else if (left.isDefined) Array((key, left.get))
                else if (right.isDefined) Array((key, right.get))
                else throw new Exception("Cannot happen")
            }
            })
            println(s2Rst.count)
    
            val end = System.currentTimeMillis
            val timeCost = (end - begin) / 1000d
            println("S2 time count: %.1f s".format(timeCost))
        }
    
    }
    

    输出

    1000000
    (688282474,0)
    (-255073127,0)
    (872746474,0)
    (-792516900,0)
    (417252803,0)
    (-1514224305,0)
    (1586932811,0)
    (1400718248,0)
    (939155130,0)
    (1475156418,0)
    100000
    100000
    100000
    100000
    1399777
    S1 time count: 39.7 s
    399984
    1399894
    S2 time count: 49.8 s
    

    我对洗牌的rdd的理解是错的?任何人都可以给出一些建议吗? 谢谢!

1 个答案:

答案 0 :(得分:0)

我找到了一种更有效地合并rdd的方法,请参阅以下两种合并策略:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SQLContext
import org.apache.spark.{HashPartitioner, SparkContext, SparkConf}
import scala.collection.mutable.ArrayBuffer

object MergeStrategy extends App {

    Logger.getLogger("org").setLevel(Level.ERROR)
    Logger.getLogger("akka").setLevel(Level.ERROR)

    val conf = new SparkConf().setMaster("local[4]").setAppName("test")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    val rddCount = 20
    val mergeCount = 5
    val dataSize = 20000
    val parts = 50

    // generate data
    scala.util.Random.setSeed(943343)
    val testData = for (i <- 0 until rddCount)
        yield sc.parallelize(scala.util.Random.shuffle((0 until dataSize).toList).map(x => (x, 0)))
            .partitionBy(new HashPartitioner(parts))
            .cache
    testData.foreach(x => println(x.count))

    // strategy 1: merge directly
    {
        val buff = ArrayBuffer[RDD[(Int, Int)]]()
        val begin = System.currentTimeMillis
        for (i <- 0 until rddCount) {
            buff += testData(i)
            if ((buff.size >= mergeCount || i == rddCount - 1) && buff.size > 1) {
                val merged = sc.union(buff).distinct
                    .partitionBy(new HashPartitioner(parts)).cache
                println(merged.count)

                buff.foreach(_.unpersist(false))
                buff.clear
                buff += merged
            }
        }
        val end = System.currentTimeMillis
        val timeCost = (end - begin) / 1000d
        println("Strategy 1 Time Cost: %.1f".format(timeCost))
        assert(buff.size == 1)

        println("Strategy 1 Complete, with merged Count %s".format(buff(0).count))
    }


    // strategy 2: merge directly without repartition
    {
        val buff = ArrayBuffer[RDD[(Int, Int)]]()
        val begin = System.currentTimeMillis
        for (i <- 0 until rddCount) {
            buff += testData(i)
            if ((buff.size >= mergeCount || i == rddCount - 1) && buff.size > 1) {
                val merged = sc.union(buff).distinct(parts).cache
                println(merged.count)

                buff.foreach(_.unpersist(false))
                buff.clear
                buff += merged
            }
        }
        val end = System.currentTimeMillis
        val timeCost = (end - begin) / 1000d
        println("Strategy 2 Time Cost: %.1f".format(timeCost))
        assert(buff.size == 1)

        println("Strategy 2 Complete, with merged Count %s".format(buff(0).count))
    }

}

结果显示策略1(时间成本20.8秒)比策略2(时间成本34.3秒)更有效。我的电脑是Windows 8,CPU 4核2.0GHz,8GB内存。

唯一的区别是由HashPartitioner划分的策略,但策略2没有划分。因此,策略1产生ShuffledRDD,但战略1产生MapPartitionsRDD。我认为 RDD.distinct 函数比MapPartitionsRDD更有效地处理ShuflledRDD。