我有数据集,我需要计算数据的连续性,如果它满足特定状态。示例数据集如下。用例是,如果交换ID连续具有Risky和Unstable状态,则将该计数增加1,并与数据集合并。我正在尝试使用Spark。
Date Exchange Id Status Consecutiveness
5/05/2017 a RISKY 0
5/05/2017 b Stable 0
5/05/2017 c Stable 0
5/05/2017 d UNSTABLE 0
5/05/2017 e UNKNOWN 0
5/05/2017 f UNKNOWN 0
6/05/2017 a RISKY 1
6/05/2017 b Stable 0
6/05/2017 c Stable 0
6/05/2017 d UNSTABLE 1
6/05/2017 e UNSTABLE 1
6/05/2017 f UNKNOWN 0
我的方法如下。
我正在尝试遵循命令。但是,有问题而无法继续3,4,5
case class Telecom(Date: String, Exchange: String, Stability: String, Cosecutive: Int)
val emp1 = sc.textFile("file:/// Filename").map(_.split(",")).map(emp1=>Telecom(emp1(0),emp1(1),emp1(2),emp1(4).trim.toInt)).toDF()
val PreviousWeek = sqlContext.sql("select * from T1 limit 10")
emp1.registerTempTable("T1")
val FailPreviousWeek = sqlContext.sql("Select Exchange, Count from T1 where Date = '5/05/2017' and Stability in ('RISKY','UNSTABLE')")
val FailCurrentWeek = sqlContext.sql("Select Exchange, Count from T1 where Date = '6/05/2017' and Stability in ('RISKY','UNSTABLE')")
FailCurrentWeek.join(FailPreviousWeek, FailCurrentWeek("Exchange") === FailPreviousWeek("Exchange"))
val UpdateCurrentWeek = FailCurrentWeek.select($"Exchange",$"Count" +1)
Val UpdateDataSet = emp1.join(UpdateCurrentWeek)
val UpdateCurrentWeek = FailCurrentWeek.select($"Exchange".alias("Exchangeid"),$"Count" +1)
答案 0 :(得分:0)
这是我心爱的 窗口聚合函数的完美案例。
我认为lag(带when
)功能可以:
lag(columnName:String,offset:Int):列返回当前行之前偏移行的值,如果当前行之前的行数少于偏移行,则返回
null
行。
import org.apache.spark.sql.expressions.Window
val exchIds = Window.partitionBy("Exchange_Id").orderBy("Date")
val cc = when(lower($"Status") === "risky" && $"lag" === $"Status", 1).
when(lower($"Status") === "unstable" && $"lag" === $"Status", 1).
otherwise(0)
val solution = input.
withColumn("lag", lag("Status", 1) over exchIds).
withColumn("Consecutiveness", cc).
orderBy("Date", "Exchange_Id").
select("Date", "Exchange_Id", "Status", "Consecutiveness")
scala> solution.show
+---------+-----------+--------+---------------+
| Date|Exchange_Id| Status|Consecutiveness|
+---------+-----------+--------+---------------+
|5/05/2017| a| RISKY| 0|
|5/05/2017| b| Stable| 0|
|5/05/2017| c| Stable| 0|
|5/05/2017| d|UNSTABLE| 0|
|5/05/2017| e| UNKNOWN| 0|
|5/05/2017| f| UNKNOWN| 0|
|6/05/2017| a| RISKY| 1|
|6/05/2017| b| Stable| 0|
|6/05/2017| c| Stable| 0|
|6/05/2017| d|UNSTABLE| 1|
|6/05/2017| e|UNSTABLE| 0|
|6/05/2017| f| UNKNOWN| 0|
+---------+-----------+--------+---------------+
答案 1 :(得分:0)
我最终使用了具有多个循环的Hive Window分区功能。
这可以与使用Spark SQL相同。