具有partitionBy for Spark数据框的窗口函数不起作用

时间:2018-05-12 04:51:01

标签: scala apache-spark apache-spark-sql

这是我的数据框1。 我正根据("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId", "UpdateReason_updateReasonId")

上的时间戳过滤掉最新信息

这里我根据6列进行排序。

uniqueFundamentalSet    PeriodId    SourceId    StatementTypeCode   StatementCurrencyId UpdateReason_updateReasonId UpdateReasonComment UpdateReasonComment_languageId  UpdateReasonEnumerationId   FFAction|!| DataPartition   PartitionYear   TimeStamp
192730230775    297 182 INC 500186  null    null    null    null    O|!|    Japan   2017    2018-05-10T10:11:15+00:00
192730230775    297 181 INC 500186  1   UpdateReason2UpdateIsNowUPdated 505074  3019680 I|!|    Japan   2017    2018-05-10T10:08:01+00:00
192730230775    297 181 INC 500186  4   New Reason Added    505074  3019683 I|!|    Japan   2017    2018-05-10T10:08:01+00:00
192730230775    297 180 INC 500186  6   InsertUpdateReason  505074  3019685 I|!|    Japan   2017    2018-05-10T09:57:29+00:00
192730230775    297 181 INC 500186  1   UpdateReason2Update 505074  3019680 I|!|    Japan   2017    2018-05-10T09:57:29+00:00
192730230775    297 182 INC 500186  6   UpdateReasonToDelete    505074  3019685 I|!|    Japan   2017    2018-05-10T09:57:29+00:00
192730230775    297 180 INC 500186  6   InsertUpdateReason  505074  3019685 I|!|    Japan   2017    2018-05-10T10:00:40+00:00
192730230775    297 181 INC 500186  1   UpdateReason2Update 505074  3019680 I|!|    Japan   2017    2018-05-10T10:00:40+00:00
192730230775    297 182 INC 500186  6   UpdateReasonToDelete    505074  3019685 I|!|    Japan   2017    2018-05-10T10:00:40+00:00

以下是

的代码
val windowSpec = Window.partitionBy("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId", "UpdateReason_updateReasonId").orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp").desc)
val latestForEachKey1 = tempReorder.withColumn("rank", row_number().over(windowSpec))
  .filter($"rank" === 1)).drop("rank")

这给了我以下的输出。

uniqueFundamentalSet    PeriodId    SourceId    StatementTypeCode   StatementCurrencyId UpdateReason_updateReasonId UpdateReasonComment UpdateReasonComment_languageId  UpdateReasonEnumerationId   FFAction|!| DataPartition   PartitionYear   TimeStamp
192730230775    297 180 INC 500186  6   InsertUpdateReason  505074  3019685 I|!|    Japan   2017    2018-05-10T10:00:40+00:00
192730230775    297 182 INC 500186  null    null    null    null    O|!|    Japan   2017    2018-05-10T10:11:15+00:00
192730230775    297 182 INC 500186  6   UpdateReasonToDelete    505074  3019685 I|!|    Japan   2017    2018-05-10T10:00:40+00:00
192730230775    297 181 INC 500186  4   New Reason Added    505074  3019683 I|!|    Japan   2017    2018-05-10T10:08:01+00:00
192730230775    297 181 INC 500186  1   UpdateReason2UpdateIsNowUPdated 505074  3019680 I|!|    Japan   2017    2018-05-10T10:08:01+00:00

接下来我想基于("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId") when FFAction|!|="O|!|" or "D|!|".

过滤掉

然后我希望将最新的第一个数据帧和第二个数据帧组合起来用于最终输出。

这样我才能获得最新的I |!|基于

("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId", "UpdateReason_updateReasonId") 
and latest for O|!| based on ("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId").

在这种情况下,我的最终输出将是

uniqueFundamentalSet    PeriodId    SourceId    StatementTypeCode   StatementCurrencyId UpdateReason_updateReasonId UpdateReasonComment UpdateReasonComment_languageId  UpdateReasonEnumerationId   FFAction|!| DataPartition   PartitionYear
192730230775    297 181 INC 500186  4   New Reason Added    505074  3019683 I|!|    Japan   2017
192730230775    297 182 INC 500186  null    null    null    null    O|!|    Japan   2017
192730230775    297 180 INC 500186  6   InsertUpdateReason  505074  3019685 I|!|    Japan   2017
192730230775    297 181 INC 500186  1   UpdateReason2UpdateIsNowUPdated 505074  3019680 I|!|    Japan   2017

这是我正在尝试的最终代码。

import org.apache.spark.sql.expressions._
    val windowSpec = Window.partitionBy("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId", "UpdateReason_updateReasonId").orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp").desc)
    val latestForEachKey1 = tempReorder.withColumn("rank", row_number().over(windowSpec))
      .filter($"rank" === 1).drop("rank")

    val windowSpec2 = Window.partitionBy("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId").orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp").desc)
    val latestForEachKey2 = latestForEachKey1.withColumn("tobefiltered", row_number().over(windowSpec2))
      .filter(($"FFAction|!|" === "I|!|" || $"FFAction|!|" === "O|!|" || ($"FFAction|!|" === "D|!|" && $"FFAction|!|" === "D|!|")) && $"tobefiltered" === 1)
      .drop("tobefiltered", "TimeStamp")

但是当我申请上面的代码时,我错过了最后一条记录

192730230775    297 181 INC 500186  1   UpdateReason2UpdateIsNowUPdated 505074  3019680 I|!|    Japan   2017 

1 个答案:

答案 0 :(得分:3)

您需要重新定义您正在使用的逻辑。在找出需要根据5列定义组的逻辑之后 uniqueFundamentalSet, PeriodId, SourceId, StatementTypeCode, StatementCurrencyId如果O|!|列中存在FFAction|!|,则import org.apache.spark.sql.expressions._ import org.apache.spark.sql.functions._ //window for checking if O|!| is present in the group val windowSpec = Window.partitionBy("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId") //window for filtering out the latest after applying the group defined in previous window val windowSpec2 = Window.partitionBy("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId", "group").orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp").desc) //udf to check if the group has O|!| or not def containsUdf = udf{(array: Seq[String])=> array.contains("O|!|")} //applying the window and udf functions and filtering in the latest val latestForEachKey1 = tempReorder.withColumn("group", when(containsUdf(collect_list("FFAction|!|").over(windowSpec)), lit("same")).otherwise($"UpdateReason_updateReasonId")) .withColumn("rank", row_number().over(windowSpec2)) .filter($"rank" === 1).drop("rank", "group") 。然后在定义组后,您可以像往常一样使用行号逻辑进行过滤

为了清晰起见,评论了解决方案

+--------------------+--------+--------+-----------------+-------------------+---------------------------+-------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+-------------------------+
|uniqueFundamentalSet|PeriodId|SourceId|StatementTypeCode|StatementCurrencyId|UpdateReason_updateReasonId|UpdateReasonComment            |UpdateReasonComment_languageId|UpdateReasonEnumerationId|FFAction|!||DataPartition|PartitionYear|TimeStamp                |
+--------------------+--------+--------+-----------------+-------------------+---------------------------+-------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+-------------------------+
|192730230775        |297     |181     |INC              |500186             |1                          |UpdateReason2UpdateIsNowUPdated|505074                        |3019680                  |I|!|       |Japan        |2017         |2018-05-10T10:08:01+00:00|
|192730230775        |297     |181     |INC              |500186             |4                          |New Reason Added               |505074                        |3019683                  |I|!|       |Japan        |2017         |2018-05-10T10:08:01+00:00|
|192730230775        |297     |182     |INC              |500186             |null                       |null                           |null                          |null                     |O|!|       |Japan        |2017         |2018-05-10T10:11:15+00:00|
|192730230775        |297     |180     |INC              |500186             |6                          |InsertUpdateReason             |505074                        |3019685                  |I|!|       |Japan        |2017         |2018-05-10T10:00:40+00:00|
+--------------------+--------+--------+-----------------+-------------------+---------------------------+-------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+-------------------------+

应该给你

NotificationCompat.Builder builder = new NotificationCompat.Builder(this);
...
builder.setColor(ContextCompat.getColor(context, R.color.my_notif_color));    
manager.notify(notificationId, builder.build())