我正在使用主列表DataFrame
,然后根据更改列表MasterFrame
合并和删除数据。
我对Scala很新,所以我毫不怀疑我没有以最有效的方式做事!
基本上只是试图确定我如何花费最终DF
的表现,因为我已经有一些DataFrames
堆叠在一起等等。
记录数为4000万+
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf()
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// Load CSV file into dataframe
val ChangeSet = sqlContext.read.json("/user/data/PythonAPITest/ChangeList.json")
val Master = spark.read.format("csv").option("header", "True").load("/user/data/PythonAPITest/master-landonline-title-memorial.csv")
// Deleted = Joining Master -> Change set. Removes records from DF where ttm_id in ChangeSet
val Deleted = Master.join(ChangeSet.where($"__change__" === "DELETE"), Seq("ttm_id"), "left_anti" )
/* Create seperate lists. Lists indicate which records need to be added*/
val InsertList = ChangeSet.select("LIST OF COLUMNS").filter($"__change__" === "INSERT")
val UpdateList = ChangeSet.select("LIST OF COLUMNS").filter($"__change__" === "UPDATE")
// UpdatedDeleted = joining Deleted -> UpdateList. Removes
val UpdatedDeleted = Deleted.join(UpdateList.where($"__change__" === "UPDATE"), Seq("ttm_id","sequence_no"), "left_anti" )
val Updated = UpdatedDeleted.join(UpdateList, Seq("ttm_id","audit_id"), "leftanti").union(UpdateList)
val Inserted = InsertList.join(Updated, Seq("ttm_id","audit_id"), "leftanti").union(Updated)
}
}
这将被提交给火花并存储在HDFS上,所以如果有更好的方法来做事,或者是寻找提示的好地方。技巧,请告诉我