Spark:使用window paritionBy为当前行的前一行

时间:2018-02-06 16:11:44

标签: apache-spark apache-spark-sql spark-dataframe

我有以下数据框,

public BpmIcountPayment(ApexPages.StandardController sc){
    if(String.isBlank(sc.getId()){
        System.debug('you screwed up passing the valid acc id');
    } else {
        acc = (Account) sc.getRecord();
    }
}

我可以使用|id |lat |lng |timestamp | +-----+---------+-----------+-------------------+ |user1|3.1357369|101.6863713|2017-11-06 19:33:16| |user1|3.1360323|101.6874385|2017-11-06 21:10:25| |user1|3.1363076|101.6902847|2017-11-07 01:39:07| |user1|3.1357369|101.6863713|2017-11-07 01:39:07| |user1|3.1357369|101.6863713|2017-11-07 04:16:30| |user1|3.1357409|101.6860155|2017-11-07 05:05:03| |user1|3.1357369|101.6863713|2017-11-07 05:05:03| |user1|3.1357369|101.6863713|2017-11-07 06:13:07| |user1|3.1360323|101.6874385|2017-11-07 06:13:07| +-----+---------+-----------+-------------------+ window ID和时间戳找到计数(出现次数),预先计数(前一次计数)和pretsp(上次时间戳)。

partitionBy

您可以在输出数据框下面找到

val specDevicePartiton = Window.partitionBy("id").orderBy("timestamp")
val specDevicePartitonTimeStamp = Window.partitionBy("id", "timestamp").orderBy("timestamp")
val userProfileDF = deviceDF.withColumn("prelatitude", lag(deviceDF("lat"), 1).over(specDevicePartiton))
    .withColumn("prelongitude", lag(deviceDF("lng"), 1).over(specDevicePartiton))
    .withColumn("pretimestamp", lag(deviceDF("timestamp"), 1).over(specDevicePartiton))
    .withColumn("pretsp", when((col("timestamp") === col("pretimestamp")), first(col("pretimestamp"))
    .over(specDevicePartitonTimeStamp)).otherwise(col("pretimestamp")))
    .withColumn("count", count("timestamp").over(specDevicePartitonTimeStamp))
    .withColumn("previousCount", lag(col("count"), 1).over(specDevicePartiton))
    .withColumn("precount", when((col("timestamp") === col("pretimestamp")), first(col("previousCount"))
    .over(specDevicePartitonTimeStamp)).otherwise(col("previousCount")))
    .withColumn("preFirstLat", when((col("precount").>(1)) && (col("count") === 1), first(col("lat")).over(specDevicePartitonPreTimeStamp.rowsBetween(-2, -1))))
    .withColumn("preFirstLng", when((col("precount").>(1)) && (col("count") === 1), first(col("lng")).over(specDevicePartitonPreTimeStamp)))
    .drop("prelatitude", "prelongitude", "nxtlatitude", "nxtlongitude", "pretimestamp") 

我想先找出当前行的第一个和最后一个。预期的输出将是这样的,

|id   |lat      |lng        |timestamp          |pretsp             |count|precount|preFirstLat|preFirstLng|
+-----+---------+-----------+-------------------+-------------------+-----+--------+-----------+-----------+    
|user1|3.1357369|101.6863713|2017-11-06 19:33:16|2017-11-06 18:44:12|1    |1       |null       |null       |
|user1|3.1360323|101.6874385|2017-11-06 21:10:25|2017-11-06 19:33:16|1    |1       |null       |null       |
|user1|3.1357369|101.6863713|2017-11-07 01:39:07|2017-11-06 21:10:25|2    |1       |null       |null       |
|user1|3.1363076|101.6902847|2017-11-07 01:39:07|2017-11-06 21:10:25|2    |1       |null       |null       |
|user1|3.1357369|101.6863713|2017-11-07 04:16:30|2017-11-07 01:39:07|1    |2       |3.1357369  |101.686727 |
|user1|3.1357369|101.6863713|2017-11-07 05:05:03|2017-11-07 04:16:30|2    |1       |null       |null       |
|user1|3.1357409|101.6860155|2017-11-07 05:05:03|2017-11-07 04:16:30|2    |1       |null       |null       |
|user1|3.1360323|101.6874385|2017-11-07 06:13:07|2017-11-07 05:05:03|2    |2       |null       |null       |
|user1|3.1357369|101.6863713|2017-11-07 06:13:07|2017-11-07 05:05:03|2    |2       |null       |null       |
+-----+---------+-----------+-------------------+-------------------+-----+--------+-----------+-----------+

逻辑: 从当前行的前一行中查找第一个lat和long值。这里前一行具有相同的时间戳和不同的lat和long值。 示例:检查时间戳= 2017-11-07 04:16:30和2017-11-07 05:05:03以上输出。

我已经尝试通过将precount视为start并将-1视为end来动态地行(start,end),但我知道如何实现这一点。

如果我得到解决方案以找出第一个值,那么我必须做同样的计算最后一个值,那么我认为它对于最后一个值是相同的。

这是一个简单的例子,

|id   |lat      |lng        |timestamp          |pretsp             |count|precount|preFirstLat|preFirstLng|
+-----+---------+-----------+-------------------+-------------------+-----+--------+-----------+-----------+
|user1|3.1357369|101.6863713|2017-11-06 19:33:16|2017-11-06 18:44:12|1    |null       |null  |null|
|user1|3.1360323|101.6874385|2017-11-06 21:10:25|2017-11-06 19:33:16|1    |1       |3.1357369  |101.6863713|
|user1|3.1357369|101.6863713|2017-11-07 01:39:07|2017-11-06 21:10:25|2    |1       |3.1360323  |101.6874385|
|user1|3.1363076|101.6902847|2017-11-07 01:39:07|2017-11-06 21:10:25|2    |1       |3.1360323  |101.6874385|
|user1|3.1357369|101.6863713|2017-11-07 04:16:30|2017-11-07 01:39:07|1    |2       |3.1357369  |101.686727 |
|user1|3.1357369|101.6863713|2017-11-07 05:05:03|2017-11-07 04:16:30|2    |1       |3.1357369  |101.6863713|
|user1|3.1357409|101.6860155|2017-11-07 05:05:03|2017-11-07 04:16:30|2    |1       |3.1357369  |101.6863713|
|user1|3.1360323|101.6874385|2017-11-07 06:13:07|2017-11-07 05:05:03|2    |2       |3.1357369  |101.6863713|
|user1|3.1357369|101.6863713|2017-11-07 06:13:07|2017-11-07 05:05:03|2    |2       |3.1357369  |101.6863713|
+-----+---------+-----------+-------------------+-------------------+-----+--------+-----------+-----------+

此处在2017-11-08突出显示的行中,prefristVal 20是2017-11-07首次进入,preLastVal 25是2017-11-07的最后一次进入。

谢谢,

0 个答案:

没有答案