我有以下数据框,比如说:
+----------+------------+------------+------------+---+
| day |inout_amount|prev_balance|post_balance|row|
+----------+------------+------------+------------+---+
|2016-10-29| -17000| 17000| 0| 1|
|2016-10-30| -17000| 17000| 0| 2|
|2016-10-30| 5600| 0| 5600| 3|
|2016-10-30| 5600| 5600| 11200| 4|
|2016-10-30| 5800| 11200| 17000| 5|
+----------+------------+------------+------------+---+
第一行对于“2016-10-29”是正确的,但是下面的4行(“2016-10-30”)被洗牌。这是上表的代码:
case class transaction(
day: String,
inout_amount: Int,
prev_balance: Int,
post_balance: Int
)
val snippet = Seq(
transaction("2016-10-29", -17000, 17000, 0),
transaction("2016-10-30", -17000, 17000, 0),
transaction("2016-10-30", 5600, 0, 5600),
transaction("2016-10-30", 5600, 5600, 11200),
transaction("2016-10-30", 5800, 11200, 17000)
)
// below could be sparkContext if you working in zeppelin
val df = sqlContext.createDataFrame(snippet)
import org.apache.spark.sql.expressions.Window
val window = Window.orderBy("day")
df.withColumn("row", row_number().over(window)).show
我现在需要根据“prev_balance”等于前一个事务的“post_balance”的逻辑对“2016-10-30”的行进行排名。即所需的数据框应如下所示:
+----------+------------+------------+------------+---+-------+
| day|inout_amount|prev_balance|post_balance|row|order-1|
+----------+------------+------------+------------+---+-------+
|2016-10-29| -17000| 17000| 0| 1| 0|
|2016-10-30| -17000| 17000| 0| 2| 4|
|2016-10-30| 5600| 0| 5600| 3| 1|
|2016-10-30| 5600| 5600| 11200| 4| 2|
|2016-10-30| 5800| 11200| 17000| 5| 3|
+----------+------------+------------+------------+---+-------+
我是新手,并且猜测我需要创建一个“udf”,然后将其应用于“withColumn”......请帮忙!
答案 0 :(得分:0)
您需要groupBy日期,根据您的标准对您的值进行排序并对结果进行flatMap。这可以通过RDD轻松完成。