在spark数据框中合并两列的最有效方法是什么?
我有两列意思相同。 timestamp
的空值应使用toAppendData_timestamp
当两列都有值时,表示值相等...
我有这个:
+--------------------+----------------------+--------+
| timestamp|toAppendData_timestamp| value|
+--------------------+----------------------+--------+
|2016-03-24 22:11:...| null| null|
| null| 2016-03-24 22:12:...|0.015625|
| null| 2016-03-19 15:54:...| 5.375|
|2016-03-19 15:55:...| 2016-03-19 15:55:...| 5.78125|
|2016-03-19 15:56:...| null| null|
|2016-03-24 22:11:...| 2016-03-24 22:11:...| 0.15625|
+--------------------+----------------------+--------+
我需要这个:
+--------------------+----------------------+--------+
| timestamp_merged|toAppendData_timestamp| value|
+--------------------+----------------------+--------+
|2016-03-24 22:11:...| null| null|
|2016-03-24 22:12:...| 2016-03-24 22:12:...|0.015625|
|2016-03-19 15:54:...| 2016-03-19 15:54:...| 5.375|
|2016-03-19 15:55:...| 2016-03-19 15:55:...| 5.78125|
|2016-03-19 15:56:...| null| null|
|2016-03-24 22:11:...| 2016-03-24 22:11:...| 0.15625|
+--------------------+----------------------+--------+
我试过这个,但没有成功:
appendedData = appendedData['timestamp'].fillna(appendedData['toAppendData_timestamp'])
答案 0 :(得分:1)
您正在寻找的功能是coalesce
。您可以从pyspark.sql.functions
:
from pyspark.sql.functions import coalesce, col
并使用:
appendedData.withColumn(
'timestamp_merged',
coalesce(col('timestamp'), col('toAppendData_timestamp'))
)