我在spark上有3个数据帧:dataframe1,dataframe2和dataframe3。
我想根据条件将dataframe1与其他数据框连接起来。
我使用以下代码:
Dataset <Row> df= dataframe1.filter(when(col("diffDate").lt(3888),dataframe1.join(dataframe2,
dataframe2.col("id_device").equalTo(dataframe1.col("id_device")).
and(dataframe2.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
and(dataframe2.col("tracking_time").lt(dataframe1.col("tracking_time")))).orderBy(dataframe2.col("tracking_time").desc())).
otherwise(dataframe1.join(dataframe3,
dataframe3.col("id_device").equalTo(dataframe1.col("id_device")).
and(dataframe3.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
and(dataframe3.col("tracking_time").lt(dataframe1.col("tracking_time")))).orderBy(dataframe3.col("tracking_time").desc())));
但是我得到了这个异常
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset
编辑
输入数据框:
dataframe1
+-----------+-------------+-------------+-------------+
| diffDate |id_device |id_vehicule |tracking_time|
+-----------+-------------+-------------+-------------+
|222 |1 |5 |2020-05-30 |
|4700 |8 |9 |2019-03-01 |
+-----------+-------------+-------------+-------------+
dataframe2
+-----------+-------------+-------------+-------------+
|id_device |id_vehicule |tracking_time|longitude |
+-----------+-------------+-------------+-------------+
|1 |5 |2020-05-12 | 33.21111 |
|8 |9 |2019-03-01 |20.2222 |
+-----------+-------------+-------------+-------------+
dataframe3
+-----------+-------------+-------------+-------------+
|id_device |id_vehicule |tracking_time|latitude |
+-----------+-------------+-------------+-------------+
|1 |5 |2020-05-12 | 40.333 |
|8 |9 |2019-02-28 |2.00000 |
+-----------+-------------+-------------+-------------+
当diffDate <3888
+-----------+-------------+-------------+-------------+-----------+-------------+-------------+------------+
| diffDate |id_device |id_vehicule |tracking_time|id_device |id_vehicule |tracking_time|longitude|
+-----------+-------------+-------------+-------------+ +-----------+-------------+-------------+-------------+
|222 |1 |5 |2020-05-30 | 1 |5 |2020-05-12 | 33.21111 |
-----------+--------------+---------------+----------+----------+--------+-----------+--------------+-----------+
当diffDate> 3888时
+-----------+-------------+-------------+-------------+-----------+-------------+-------------+------------+
| diffDate |id_device |id_vehicule |tracking_time|id_device |id_vehicule |tracking_time|latitude|
+-----------+-------------+-------------+-------------+ +-----------+-------------+-------------+-------------+
|4700 |9 |5 |2019-03-01 | 8 |9 |2019-02-28 | 2.00000 |
-----------+--------------+---------------+----------+----------+--------+-----------+--------------+-----------+
我需要你的帮助
谢谢。
答案 0 :(得分:1)
我认为您需要重新访问代码。
您正在尝试对dataframe1
的每一行执行联接(当然基于条件),我认为这是不正确的要求或被误解的要求。
when(condition, then).otherwise()
函数为基础数据帧的每一行执行,通常用于根据条件处理该列。函数中的then
和else/otherwise
子句仅支持literals
,它们是数据框基本/复杂类型和文字中的现有列。 您不能在其中放置数据框或任何将数据框输出的操作
您可能需要将datafrmae1
的行中的datafrmae2
与col("diffDate").lt(3888)
连接起来。为此,您可以执行以下操作-
dataframe1.join(dataframe2,
dataframe2.col("id_device").equalTo(dataframe1.col("id_device")).
and(dataframe2.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
and(dataframe2.col("tracking_time").lt(dataframe1.col("tracking_time"))).
and(dataframe1.col("diffDate").lt(3888))
)
.orderBy(dataframe2.col("tracking_time").desc())
dataframe1.as("a").join(dataframe2.as("b"),
dataframe2.col("id_device").equalTo(dataframe1.col("id_device")).
and(dataframe2.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
and(dataframe2.col("tracking_time").lt(dataframe1.col("tracking_time"))).
and(dataframe1.col("diffDate").lt(3888))
).selectExpr("a.*", "b.longitude", "null as latitude")
.unionByName(
dataframe1.as("a").join(dataframe3.as("c"),
dataframe3.col("id_device").equalTo(dataframe1.col("id_device")).
and(dataframe3.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
and(dataframe3.col("tracking_time").lt(dataframe1.col("tracking_time"))).
and(dataframe1.col("diffDate").geq(3888))
).selectExpr("a.*", "c.latitude", "null as longitude")
)