Question

在spark数据框中合并两列的最有效方法是什么？

我有两列意思相同。 timestamp的空值应使用toAppendData_timestamp

中的值填充

当两列都有值时，表示值相等...

我有这个：

+--------------------+----------------------+--------+
|           timestamp|toAppendData_timestamp|   value|
+--------------------+----------------------+--------+
|2016-03-24 22:11:...|                  null|    null|
|                null|  2016-03-24 22:12:...|0.015625|
|                null|  2016-03-19 15:54:...|   5.375|
|2016-03-19 15:55:...|  2016-03-19 15:55:...| 5.78125|
|2016-03-19 15:56:...|                  null|    null|
|2016-03-24 22:11:...|  2016-03-24 22:11:...| 0.15625|
+--------------------+----------------------+--------+

我需要这个：

+--------------------+----------------------+--------+
|    timestamp_merged|toAppendData_timestamp|   value|
+--------------------+----------------------+--------+
|2016-03-24 22:11:...|                  null|    null|
|2016-03-24 22:12:...|  2016-03-24 22:12:...|0.015625|
|2016-03-19 15:54:...|  2016-03-19 15:54:...|   5.375|
|2016-03-19 15:55:...|  2016-03-19 15:55:...| 5.78125|
|2016-03-19 15:56:...|                  null|    null|
|2016-03-24 22:11:...|  2016-03-24 22:11:...| 0.15625|
+--------------------+----------------------+--------+

我试过这个，但没有成功：

appendedData = appendedData['timestamp'].fillna(appendedData['toAppendData_timestamp'])

Answer 1

您正在寻找的功能是coalesce。您可以从pyspark.sql.functions：

导入它

from pyspark.sql.functions import coalesce, col

并使用：

appendedData.withColumn(
    'timestamp_merged', 
    coalesce(col('timestamp'), col('toAppendData_timestamp'))
)

在spark数据帧中合并时间戳列的最有效方法

1 个答案: