我有一个如下的json文件:
{"ts": "01/03/2018 15:48:09+0530", "userid": "user1", "eventid":"EnterTripDetail" }
{"ts": "01/03/2018 15:48:09+0530", "userid": "user2", "eventid":"EnterTripDetail" }
{"ts": "01/03/2018 15:48:10+0530", "userid": "user1", "eventid":"ClickToPayTrip" }
{"ts": "01/03/2018 15:48:10+0530", "userid": "user2", "eventid":"ClickToPayTrip" }
{"ts": "01/03/2018 15:48:11+0530", "userid": "user1", "eventid":"SubmitPayment" }
当前代码:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
Dataset<Row> df = spark.read().json("/examples/transaction.json");
df.show();
现在,我想知道10分钟内尚未完成付款的用户列表,就我而言,我想知道ClickToPayTrip
至SubmitPayment
之间的时间超过10分钟的用户或没有用户的SubmitPayment
条目。
答案 0 :(得分:1)
点击次数和提交次数可以分为不同的数据框,然后通过左联接进行联接,并进行过滤以仅保留未付款或未付款的用户:
// get clicks and payments
Dataset<Row> clickToPayTripDF = df.where(col("eventid").equalTo("ClickToPayTrip"));
Dataset<Row> submitPaymentDF = df.where(col("eventid").equalTo("SubmitPayment"));
// join
Dataset<Row> joined = clickToPayTripDF.alias("click")
.join(submitPaymentDF.alias("payment"), clickToPayTripDF.col("userid").equalTo(submitPaymentDF.col("userid")), "left");
// filter
Dataset<Row> result = joined
.withColumn("clickSeconds", to_timestamp(col("click.ts"), "dd/MM/yyyy HH:mm:ss").cast("long"))
.withColumn("paymentSeconds", to_timestamp(col("payment.ts"), "dd/MM/yyyy HH:mm:ss").cast("long"))
.where(
col("payment.eventid").isNull().or(
expr("paymentSeconds-clickSeconds > 600")
))
.drop("clickSeconds", "paymentSeconds")
.select("click.userid", "click.ts", "click.eventid");
result.show(false);
输出:
+------+------------------------+--------------+
|userid|ts |eventid |
+------+------------------------+--------------+
|user2 |01/03/2018 15:48:10+0530|ClickToPayTrip|
+------+------------------------+--------------+