我不知道如何将这两个DataFrame相互连接。
第一个DataFrame存储有关用户到服务中心的请求时间的信息。
我们将此数据框称为df1
:
+-----------+---------------------+
| USER_NAME | REQUEST_DATE |
+-----------+---------------------+
| Alex | 2018-03-01 00:00:00 |
| Alex | 2018-09-01 00:00:00 |
| Bob | 2018-03-01 00:00:00 |
| Mark | 2018-02-01 00:00:00 |
| Mark | 2018-07-01 00:00:00 |
| Kate | 2018-02-01 00:00:00 |
+-----------+---------------------+
第二个DataFrame存储有关用户可以使用服务中心的服务的可能期限(许可期限)的信息。
我们称它为df2
。
+-----------+---------------------+---------------------+------------+
| USER_NAME | START_SERVICE | END_SERVICE | STATUS |
+-----------+---------------------+---------------------+------------+
| Alex | 2018-01-01 00:00:00 | 2018-06-01 00:00:00 | Active |
| Bob | 2018-01-01 00:00:00 | 2018-02-01 00:00:00 | Not Active |
| Mark | 2018-01-01 00:00:00 | 2018-05-01 23:59:59 | Active |
| Mark | 2018-05-01 00:00:00 | 2018-08-01 23:59:59 | VIP |
+-----------+---------------------+---------------------+------------+
如何加入这两个DataFrame并返回这样的结果?如何在处理时获取用户许可证类型列表?
+-----------+---------------------+----------------+
| USER_NAME | REQUEST_DATE | STATUS |
+-----------+---------------------+----------------+
| Alex | 2018-03-01 00:00:00 | Active |
| Alex | 2018-09-01 00:00:00 | No information |
| Bob | 2018-03-01 00:00:00 | Not Active |
| Mark | 2018-02-01 00:00:00 | Active |
| Mark | 2018-07-01 00:00:00 | VIP |
| Kate | 2018-02-01 00:00:00 | No information |
+-----------+---------------------+----------------+
代码:
import org.apache.spark.sql.DataFrame
val df1: DataFrame = Seq(
("Alex", "2018-03-01 00:00:00"),
("Alex", "2018-09-01 00:00:00"),
("Bob", "2018-03-01 00:00:00"),
("Mark", "2018-02-01 00:00:00"),
("Mark", "2018-07-01 00:00:00"),
("Kate", "2018-07-01 00:00:00")
).toDF("USER_NAME", "REQUEST_DATE")
df1.show()
val df2: DataFrame = Seq(
("Alex", "2018-01-01 00:00:00", "2018-06-01 00:00:00", "Active"),
("Bob", "2018-01-01 00:00:00", "2018-02-01 00:00:00", "Not Active"),
("Mark", "2018-01-01 00:00:00", "2018-05-01 23:59:59", "Active"),
("Mark", "2018-05-01 00:00:00", "2018-08-01 23:59:59", "Active")
).toDF("USER_NAME", "START_SERVICE", "END_SERVICE", "STATUS")
df1.show()
val total = df1.join(df2, df1("USER_NAME")===df2("USER_NAME"), "left").filter(df1("REQUEST_DATE") >= df2("START_SERVICE") && df1("REQUEST_DATE") <= df2("END_SERVICE")).select(df1("*"), df2("STATUS"))
total.show()
错误:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:367)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:144)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:140)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:140)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:135)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:232)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:102)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:181)
at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:35)
at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:65)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:181)
at org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:85)
at org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:206)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:181)
at org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:354)
at org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:383)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:88)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:83)
at org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:354)
at org.apache.spark.sql.execution.FilterExec.doProduce(basicPhysicalOperators.scala:125)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:88)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:83)
at org.apache.spark.sql.execution.FilterExec.produce(basicPhysicalOperators.scala:85)
at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:45)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:88)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:83)
at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:35)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:97)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:88)
答案 0 :(得分:2)
如何加入这两个DataFrame并返回这样的结果?
df_joined = df1.join(df2, Seq('USER_NAME'), 'left' )
如何获取许可证仍然相关的所有用户的列表?
df_relevant = df_joined
.withColumn('STATUS', when(col('REQUEST_DATE') > col('START_SERVICE') and col('REQUEST_DATE') < col('END_SERVICE'), col('STATUS')).otherwise('No information')
.select('USER_NAME', 'REQUEST_DATE', 'STATUS' )
答案 1 :(得分:0)
您正在比较<=和> =在不正确的字符串值上。在进行此类比较之前,应将它们转换为时间戳。以下代码对我有用。
顺便说一句,您使用的过滤条件没有给出您在问题中发布的结果。请再次检查。
destination_arn = "${aws_cloudwatch_log_group.lg.arn}"
format = ""
}