我有两个数据帧,一个带有各种数据df_datas,另一个带有仅businessDays df_dates。现在,我需要使用df_datas中的列(dateOperation)查询df_dates(businessDay)中的列作为where子句。最后需要将结果添加为df_datas的新列。
我正在使用spark 2.1
我尝试首先使用Hive sql进行此操作,但不允许子查询。
然后,我尝试创建一个函数,该函数可以从df_dates中找到第二个工作日,我想将其应用于其他df_datas。
def twoBusinessDays2 (x : String) : String = {
val df_dates_ranked = df_calanu.select("businessDay").filter("businessDay > '"+x+"'").withColumn("dense_rank", dense_rank().over(Window.orderBy("businessDay"))).filter("dense_rank = '2'")
val results = df_calanu_ranked.map(row => row.mkString(" ")).collect
return results.head
}
val twoBusinessDays_udf2 = udf(twoBusinessDays2 _)
val dfnew = df_datas.withColumn("newDate", twoBusinessDays_udf2(df_datas.col("dateOperation")))
该函数从dateOperation列获取一个String并将结果作为字符串给出。但这不起作用,它给了我这个错误:
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1139)
示例:
df_datas
+------+------+----------------+
| Col1 | Col2 | dateOperation |
+------+-----------------------+
| A | 4 | 2019-01-01 |
| B | 2 | 2019-01-02 |
| C | 3 | 2019-03-03 |
| A | 1 | 2019-05-03 |
+------+------+----------------+
df_dates
+------+--------+--------------+
| Col1 | Col2 | businessDay |
+------+-----------------------+
| 1 | MON | 2019-01-01 |
| 5 | FRI | 2019-01-02 |
| 4 | THU | 2019-03-01 |
| 2 | TUE | 2019-05-01 |
+------+--------+--------------+
结果
+------+------+----------------+---------------+
| Col1 | Col2 | dateOperation | newDate |
+------+-----------------------+---------------+
| A | 4 | 2019-01-01 | 2019-01-03 |
| B | 2 | 2019-01-02 | 2019-01-04 |
| C | 3 | 2019-03-01 | 2019-03-05 |
| A | 1 | 2019-05-03 | 2019-05-07 |
+------+------+----------------+---------------+
newDate = dateOperation + 2个工作日