如何比较scala中数据框中的记录

时间:2016-08-11 07:25:09

标签: scala apache-spark spark-dataframe

例如,我有dataframe如下:

var tmp_df = sqlContext.createDataFrame(Seq(
  ("One", "Sagar", 1), 
  ("Two", "Ramesh" , 2), 
  ("Three", "Suresh", 3), 
  ("One", "Sagar", 5)
)).toDF("ID", "Name", "Balance");

现在我想同样在一个文件中写入具有相同ID的上述数据帧的所有记录。请指教。

1 个答案:

答案 0 :(得分:0)

//find records having same id and rename the id column to idstowrite
val idsMoreThanOne = tmp_df.groupBy('id).count.filter('count.gt(1)).withColumnRenamed("id" , "idstowrite")
idsMoreThanOne.show
//join back with original dataframe
val joinedDf = idsMoreThanOne.join(tmp_df ,tmp_df("id") === idsMoreThanOne("idstowrite") , "left")
joinedDf.show
//select only the columns we want
val dfToWrite = joinedDf.select("id" , "Name" , "Balance")
dfToWrite.show

result dataframe