我有数据框。我需要每个Id的updateTableTimestamp基于表的最新记录。
df.show()
+--------------------+-----+-----+--------------------+
| Description| Name| id |updateTableTimestamp|
+--------------------+-----+-----+--------------------+
| | 042F|64185| 1507306990753|
| | 042F|64185| 1507306990759|
|Testing |042MF| 941| 1507306990753|
| | 058F| 8770| 1507306990753|
|Testing 3 |083MF|31663| 1507306990759|
|Testing 2 |083MF|31663| 1507306990753|
+--------------------+-----+-----+--------------------+
需要输出
+--------------------+-----+-----+--------------------+
| Description| Name| id |updateTableTimestamp|
+--------------------+-----+-----+--------------------+
| | 042F|64185| 1507306990759|
|Testing |042MF| 941| 1507306990753|
| | 058F| 8770| 1507306990753|
|Testing 3 |083MF|31663| 1507306990759|
+--------------------+-----+-----+--------------------+
我试过了
sqlContext.sql("SELECT * FROM (SELECT *, row_number() OVER (PARTITION BY Id ORDER BY updateTableTimestamp DESC) rank from temptable) tmp where rank = 1")
它在分区上出错。线程“main”java.lang.RuntimeException: [1.29] failure: ``union'' expected but
中的异常('找到`我正在使用spark 1.6.2
答案 0 :(得分:0)
选择 描述,名称,id,updateTableTimestamp 来自table_name 身份在哪里 (按updateTableTimestamp从table_name group中选择id)按updateTableTimestamp desc排序;
答案 1 :(得分:0)
import org.apache.spark.sql.functions.first
import org.apache.spark.sql.functions.desc
import org.apache.spark.sql.functions.col
val dfOrder = df.orderBy(col("id"), col("updateTableTimestamp").desc)
val dfMax = dfOrder.groupBy(col("id")).
agg(first("description").as("description"),
first("name").as("name"),
first("updateTableTimestamp").as("updateTableTimestamp"))
dfMax.show
之后,如果您想重新排序字段,只需将 选择 功能应用于新的DF。