Spark:具有数据框的复杂操作

时间:2019-02-18 12:59:39

标签: java apache-spark apache-spark-sql

我具有以下格式的输入数据集:

+---+--------+----------+
| id|   refId| timestamp|
+---+--------+----------+
|  1|    null|1548944642|
|  1|29950529|1548937685|
|  2|27510720|1548944885|
|  2|27510720|1548943617|
+---+--------+----------+

需要使用以下转换逻辑添加新列session

  1. 如果为refId is null,则会话值为true。
  2. 如果为id and refId are unique,则会话值为true。
  3. 如果id and refId are not unique和`时间戳大于上一行,则会话值为true。时间戳之间的差异也应该> 60。
+---+--------+-------+----------+
| id|   refId|session| timestamp|
+---+--------+-------+----------+
|  1|    null|   true|1548944642|
|  1|29950529|   true|1548937685|
|  2|27510720|  false|1548943617|
|  2|27510720|   true|1548944885|
+---+--------+-------+----------+

我能够分别处理1和3个条件,但不能处理第二个。

  1. `data.withColumn(“ session”,functions.when(data.col(“ refId”)。isNull(),true));
  2. 3。
WindowSpec w = Window.partitionBy("id, refid").orderBy(timestampDS.col("timestamp"));
functions.coalesce(timestampDS.col("timestamp").cast("long").$minus(functions.lag("timestamp", 1).over(w).cast("long")), functions.lit(0));

我的问题是如何满足第二个条件并一起实现所有三个转换。

2 个答案:

答案 0 :(得分:1)

我想说使用Spark SQL可以降低复杂性并轻松实现结果

df.createOrReplaceTempView("test")

spark.sql("select id,refId,timestamp,case when refId is null and id is not null then 'true' when id is not null and refId is not null and rank=1 then 'true' else 'false' end as session from  (select id,refId,timestamp, rank() OVER (PARTITION BY id,refId ORDER BY timestamp DESC) as rank from test) c").show()

输出看起来像这样:

+---+--------+----------+-------+
| id|   refId| timestamp|session|
+---+--------+----------+-------+
|  1|    null|1548944642|   true|
|  1|29950529|1548937685|   true|
|  2|27510720|1548944885|   true|
|  2|27510720|1548943617|  false|
+---+--------+----------+-------+ 

答案 1 :(得分:1)

您可以使用窗口函数对id和rfId进行分组并按时间戳排序,然后添加一个等级列。最后,您在会话列中添加了when或sql函数。

public void startScan() { //use this when you want to resume the camera
    if (scannerView != null) {
        scannerView.setResultHandler(this);
        scannerView.startCamera();
        rescan();
    }
}

public void stopScan() { //use this when you want to stop scanning
// it is very important to do that,
// because the camera will keep scanning codes in background
    if (scannerView != null) {
        scannerView.stopCameraPreview();
        scannerView.stopCamera();
    }
}

public void rescan() {
    if (scannerView != null) {
        scannerView.resumeCameraPreview(this);
    }
}