有条件地动态创建Spark(scala)中的新列

时间:2017-11-08 17:11:08

标签: scala apache-spark

我正在尝试在spark中创建一个新的条件列,该列从以编程方式选择的现有列填充,并基于第三列的已处理输出。

道歉这听起来很复杂,但这是一个例子。样本df:

// sample df
val df = Seq(
  (1, "2014/07/31 23:00:01", 1, 2), 
  (1, "2014/07/30 12:40:32", 3, 3), 
  (1, "2016/08/09 10:12:43", 5, 6))
.toDF("id", "date", "7_col", "8_col")
.withColumn("timestamp", unix_timestamp($"date", "yyyy/MM/dd HH:mm:ss").cast("timestamp"))

+---+-------------------+-----+-----+-------------------+
| id|               date|7_col|8_col|          timestamp|
+---+-------------------+-----+-----+-------------------+
|  1|2014/07/31 23:00:01|    1|    2|2014-07-31 23:00:01|
|  1|2014/07/30 12:40:32|    3|    3|2014-07-30 12:40:32|
|  1|2016/08/09 10:12:43|    5|    6|2016-08-09 10:12:43|
+---+-------------------+-----+-----+-------------------+

现在,我想创建一个新列,其中包含7_col8_col的内容,具体取决于timestamp列中的月份是否为第7个月(7_col)或第8个月(8_col)。所以结果应该是这样的:

+---+-------------------+-----+-----+-------------------+-------+
| id|               date|7_col|8_col|          timestamp|new_col|
+---+-------------------+-----+-----+-------------------+-------+
|  1|2014/07/31 23:00:01|    1|    2|2014-07-31 23:00:01|      1|
|  1|2014/07/30 12:40:32|    3|    3|2014-07-30 12:40:32|      3|
|  1|2016/08/09 10:12:43|    5|    6|2016-08-09 10:12:43|      6|
+---+-------------------+-----+-----+-------------------+-------+

现在,如果我只是将月份作为Int传递并将其插入到要传递的列名称的输入中,我可以以编程方式部分地执行此操作,如下所示:

df.withColumn("new_col", $"${7}_col" ).show 

+---+-------------------+-----+-----+-------------------+-------+
| id|               date|7_col|8_col|          timestamp|new_col|
+---+-------------------+-----+-----+-------------------+-------+
|  1|2014/07/31 23:00:01|    1|    2|2014-07-31 23:00:01|      1|
|  1|2014/07/30 12:40:32|    3|    3|2014-07-30 12:40:32|      3|
|  1|2016/08/09 10:12:43|    5|    6|2016-08-09 10:12:43|      5|
+---+-------------------+-----+-----+-------------------+-------+

然而,当我尝试从timestamp列传递提取的月份而不是数字时,它无法正常工作:

df.withColumn("new_col", $"${month($"timestamp")}_col").show 

Error: org.apache.spark.sql.AnalysisException: cannot resolve '`month(timestamp)_col`' given input columns: [7_col, id, date, 8_col, timestamp];

现在,我知道提取月份的代码有效并产生Int结果,例如,我可以简单地用我的new_col填充提取的月Int

df.withColumn("new_col", month($"timestamp")).show 

+---+-------------------+-----+-----+-------------------+-------+
| id|               date|7_col|8_col|          timestamp|new_col|
+---+-------------------+-----+-----+-------------------+-------+
|  1|2014/07/31 23:00:01|    1|    2|2014-07-31 23:00:01|      7|
|  1|2014/07/30 12:40:32|    3|    3|2014-07-30 12:40:32|      7|
|  1|2016/08/09 10:12:43|    5|    6|2016-08-09 10:12:43|      8|
+---+-------------------+-----+-----+-------------------+-------+

但是我无法弄清楚为什么我不能传递这样的Int并将其插入到列名中。

有什么想法吗?

1 个答案:

答案 0 :(得分:2)

您可以使用when.otherwise

df.withColumn("new_col", when(month($"timestamp") === 7, $"7_col").otherwise($"8_col")).show
+---+-------------------+-----+-----+-------------------+-------+
| id|               date|7_col|8_col|          timestamp|new_col|
+---+-------------------+-----+-----+-------------------+-------+
|  1|2014/07/31 23:00:01|    1|    2|2014-07-31 23:00:01|      1|
|  1|2014/07/30 12:40:32|    3|    3|2014-07-30 12:40:32|      3|
|  1|2016/08/09 10:12:43|    5|    6|2016-08-09 10:12:43|      6|
+---+-------------------+-----+-----+-------------------+-------+

动态处理month_col的另一个选项:

val months = (7 to 8).map(m => when(month(col("timestamp")) === m, col(s"${m}_col")))
//       change 7 to 8 to a sequence of all exsiting months columns

df.withColumn("new_col", coalesce(months: _*)).show
+---+-------------------+-----+-----+-------------------+-------+
| id|               date|7_col|8_col|          timestamp|new_col|
+---+-------------------+-----+-----+-------------------+-------+
|  1|2014/07/31 23:00:01|    1|    2|2014-07-31 23:00:01|      1|
|  1|2014/07/30 12:40:32|    3|    3|2014-07-30 12:40:32|      3|
|  1|2016/08/09 10:12:43|    5|    6|2016-08-09 10:12:43|      6|
+---+-------------------+-----+-----+-------------------+-------+