使用多个DataFrame时的最佳性能

时间:2018-02-01 02:25:01

标签: scala dataframe spark-dataframe

我正在使用主列表DataFrame,然后根据更改列表MasterFrame合并和删除数据。

我对Scala很新,所以我毫不怀疑我没有以最有效的方式做事!

基本上只是试图确定我如何花费最终DF的表现,因为我已经有一些DataFrames堆叠在一起等等。 记录数为4000万+

 object SparkPi {
        def main(args: Array[String]) {
                val conf = new SparkConf()
                val sc = new SparkContext(conf)
                val sqlContext = new SQLContext(sc)

                //  Load CSV file into dataframe  
                val ChangeSet = sqlContext.read.json("/user/data/PythonAPITest/ChangeList.json")
                val Master = spark.read.format("csv").option("header", "True").load("/user/data/PythonAPITest/master-landonline-title-memorial.csv")

                // Deleted = Joining Master -> Change set. Removes records from DF where ttm_id in ChangeSet
                val Deleted = Master.join(ChangeSet.where($"__change__" === "DELETE"), Seq("ttm_id"), "left_anti" )

                /* Create seperate lists. Lists indicate which records need to be added*/
                val InsertList = ChangeSet.select("LIST OF COLUMNS").filter($"__change__" === "INSERT")
                val UpdateList = ChangeSet.select("LIST OF COLUMNS").filter($"__change__" === "UPDATE")


                // UpdatedDeleted = joining Deleted -> UpdateList. Removes 
                val UpdatedDeleted = Deleted.join(UpdateList.where($"__change__" === "UPDATE"), Seq("ttm_id","sequence_no"), "left_anti" )
                val Updated = UpdatedDeleted.join(UpdateList, Seq("ttm_id","audit_id"), "leftanti").union(UpdateList)
                val Inserted = InsertList.join(Updated, Seq("ttm_id","audit_id"), "leftanti").union(Updated)

        }
}

这将被提交给火花并存储在HDFS上,所以如果有更好的方法来做事,或者是寻找提示的好地方。技巧,请告诉我

0 个答案:

没有答案