查找时差较大的行并将这些行复制到新列

时间:2018-10-25 05:37:34

标签: java apache-spark apache-spark-sql

我有一个如下的json文件:

{"ts": "01/03/2018 15:48:09+0530", "userid": "user1", "eventid":"EnterTripDetail" }
{"ts": "01/03/2018 15:48:09+0530", "userid": "user2", "eventid":"EnterTripDetail" }
{"ts": "01/03/2018 15:48:10+0530", "userid": "user1", "eventid":"ClickToPayTrip" }
{"ts": "01/03/2018 15:48:10+0530", "userid": "user2", "eventid":"ClickToPayTrip" }
{"ts": "01/03/2018 15:48:11+0530", "userid": "user1", "eventid":"SubmitPayment" }

当前代码:

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

Dataset<Row> df = spark.read().json("/examples/transaction.json");
df.show();

现在,我想知道10分钟内尚未完成付款的用户列表,就我而言,我想知道ClickToPayTripSubmitPayment之间的时间超过10分钟的用户或没有用户的SubmitPayment条目。

1 个答案:

答案 0 :(得分:1)

点击次数和提交次数可以分为不同的数据框,然后通过左联接进行联接,并进行过滤以仅保留未付款或未付款的用户:

// get clicks and payments
Dataset<Row> clickToPayTripDF = df.where(col("eventid").equalTo("ClickToPayTrip"));
Dataset<Row> submitPaymentDF = df.where(col("eventid").equalTo("SubmitPayment"));

// join
Dataset<Row> joined = clickToPayTripDF.alias("click")
    .join(submitPaymentDF.alias("payment"), clickToPayTripDF.col("userid").equalTo(submitPaymentDF.col("userid")), "left");
// filter
Dataset<Row> result = joined
    .withColumn("clickSeconds", to_timestamp(col("click.ts"), "dd/MM/yyyy HH:mm:ss").cast("long"))
    .withColumn("paymentSeconds", to_timestamp(col("payment.ts"), "dd/MM/yyyy HH:mm:ss").cast("long"))
    .where(
        col("payment.eventid").isNull().or(
            expr("paymentSeconds-clickSeconds > 600")
        ))
    .drop("clickSeconds", "paymentSeconds")
    .select("click.userid", "click.ts", "click.eventid");

result.show(false);

输出:

+------+------------------------+--------------+
|userid|ts                      |eventid       |
+------+------------------------+--------------+
|user2 |01/03/2018 15:48:10+0530|ClickToPayTrip|
+------+------------------------+--------------+