Question

我需要基于datecol的开始和结束日期（因为每个版本的开始和结束都在特定的日期）从不同的列（同一列的不同版本，例如datecol，col1_v1，col1_v2，col1_v3 ....）中选择值），然后将它们合并到一个列中

我已经将开始日期和结束日期作为键值对，并且使用.between函数，我能够基于日期条件之间的过滤条件获取每一列的值。（循环内）但是我需要获取所有结果都作为一列。

df.withColumn("resultColumn",col("col1_v1").where(col("datecol").between(startdate,enddate))

上面是一个循环，其中根据开始日期和结束日期从不同的列版本中选择。而且它必须将结果合并为单个列。

datecol     col1_v1 col1_v2 col1_v3 result
01/01/2019  11      21      31      11
02/01/2019  12      22      32      22
03/01/2019  13      23      33      33

Answer 1

您可以使用一些串联功能，例如concat_ws：

import org.apache.spark.sql.functions.concat_ws

dataframe.withColumn("resultColumn1",col("col_v1")).
  withColumn("resultColumn2", col("col_v2")).
  withColumn("resultColumn3", col("col_v3")).
  withColumn("result", concat_ws(",", dataframe.columns.filter(el => el != "date_col").
  map(c => col(c)): _*)).show()

给予：

+----------+------+------+------+-------------+-------------+-------------+--------+
|  date_col|col_v1|col_v2|col_v3|resultColumn1|resultColumn2|resultColumn3|  result|
+----------+------+------+------+-------------+-------------+-------------+--------+
|01/01/2019|    11|    21|    31|           11|           21|           31|11,21,31|
|02/01/2019|    12|    22|    32|           12|           22|           32|12,22,32|
|03/01/2019|    13|    23|    33|           13|           23|           33|13,23,33|
+----------+------+------+------+-------------+-------------+-------------+--------+

如何基于开始日期列和合并到单个列中选择不同列的部分

1 个答案: