我正在尝试在spark中创建一个新的条件列,该列从以编程方式选择的现有列填充,并基于第三列的已处理输出。
道歉这听起来很复杂,但这是一个例子。样本df:
// sample df
val df = Seq(
(1, "2014/07/31 23:00:01", 1, 2),
(1, "2014/07/30 12:40:32", 3, 3),
(1, "2016/08/09 10:12:43", 5, 6))
.toDF("id", "date", "7_col", "8_col")
.withColumn("timestamp", unix_timestamp($"date", "yyyy/MM/dd HH:mm:ss").cast("timestamp"))
+---+-------------------+-----+-----+-------------------+
| id| date|7_col|8_col| timestamp|
+---+-------------------+-----+-----+-------------------+
| 1|2014/07/31 23:00:01| 1| 2|2014-07-31 23:00:01|
| 1|2014/07/30 12:40:32| 3| 3|2014-07-30 12:40:32|
| 1|2016/08/09 10:12:43| 5| 6|2016-08-09 10:12:43|
+---+-------------------+-----+-----+-------------------+
现在,我想创建一个新列,其中包含7_col
或8_col
的内容,具体取决于timestamp
列中的月份是否为第7个月(7_col
)或第8个月(8_col
)。所以结果应该是这样的:
+---+-------------------+-----+-----+-------------------+-------+
| id| date|7_col|8_col| timestamp|new_col|
+---+-------------------+-----+-----+-------------------+-------+
| 1|2014/07/31 23:00:01| 1| 2|2014-07-31 23:00:01| 1|
| 1|2014/07/30 12:40:32| 3| 3|2014-07-30 12:40:32| 3|
| 1|2016/08/09 10:12:43| 5| 6|2016-08-09 10:12:43| 6|
+---+-------------------+-----+-----+-------------------+-------+
现在,如果我只是将月份作为Int
传递并将其插入到要传递的列名称的输入中,我可以以编程方式部分地执行此操作,如下所示:
df.withColumn("new_col", $"${7}_col" ).show
+---+-------------------+-----+-----+-------------------+-------+
| id| date|7_col|8_col| timestamp|new_col|
+---+-------------------+-----+-----+-------------------+-------+
| 1|2014/07/31 23:00:01| 1| 2|2014-07-31 23:00:01| 1|
| 1|2014/07/30 12:40:32| 3| 3|2014-07-30 12:40:32| 3|
| 1|2016/08/09 10:12:43| 5| 6|2016-08-09 10:12:43| 5|
+---+-------------------+-----+-----+-------------------+-------+
然而,当我尝试从timestamp
列传递提取的月份而不是数字时,它无法正常工作:
df.withColumn("new_col", $"${month($"timestamp")}_col").show
Error: org.apache.spark.sql.AnalysisException: cannot resolve '`month(timestamp)_col`' given input columns: [7_col, id, date, 8_col, timestamp];
现在,我知道提取月份的代码有效并产生Int
结果,例如,我可以简单地用我的new_col
填充提取的月Int
:
df.withColumn("new_col", month($"timestamp")).show
+---+-------------------+-----+-----+-------------------+-------+
| id| date|7_col|8_col| timestamp|new_col|
+---+-------------------+-----+-----+-------------------+-------+
| 1|2014/07/31 23:00:01| 1| 2|2014-07-31 23:00:01| 7|
| 1|2014/07/30 12:40:32| 3| 3|2014-07-30 12:40:32| 7|
| 1|2016/08/09 10:12:43| 5| 6|2016-08-09 10:12:43| 8|
+---+-------------------+-----+-----+-------------------+-------+
但是我无法弄清楚为什么我不能传递这样的Int并将其插入到列名中。
有什么想法吗?
答案 0 :(得分:2)
您可以使用when.otherwise
:
df.withColumn("new_col", when(month($"timestamp") === 7, $"7_col").otherwise($"8_col")).show
+---+-------------------+-----+-----+-------------------+-------+
| id| date|7_col|8_col| timestamp|new_col|
+---+-------------------+-----+-----+-------------------+-------+
| 1|2014/07/31 23:00:01| 1| 2|2014-07-31 23:00:01| 1|
| 1|2014/07/30 12:40:32| 3| 3|2014-07-30 12:40:32| 3|
| 1|2016/08/09 10:12:43| 5| 6|2016-08-09 10:12:43| 6|
+---+-------------------+-----+-----+-------------------+-------+
动态处理month_col
的另一个选项:
val months = (7 to 8).map(m => when(month(col("timestamp")) === m, col(s"${m}_col")))
// change 7 to 8 to a sequence of all exsiting months columns
df.withColumn("new_col", coalesce(months: _*)).show
+---+-------------------+-----+-----+-------------------+-------+
| id| date|7_col|8_col| timestamp|new_col|
+---+-------------------+-----+-----+-------------------+-------+
| 1|2014/07/31 23:00:01| 1| 2|2014-07-31 23:00:01| 1|
| 1|2014/07/30 12:40:32| 3| 3|2014-07-30 12:40:32| 3|
| 1|2016/08/09 10:12:43| 5| 6|2016-08-09 10:12:43| 6|
+---+-------------------+-----+-----+-------------------+-------+