在spark数据帧中合并时间戳列的最有效方法

时间:2016-10-21 20:38:46

标签: apache-spark dataframe pyspark

在spark数据框中合并两列的最有效方法是什么?

我有两列意思相同。 timestamp的空值应使用toAppendData_timestamp

中的值填充

当两列都有值时,表示值相等...

我有这个:

+--------------------+----------------------+--------+
|           timestamp|toAppendData_timestamp|   value|
+--------------------+----------------------+--------+
|2016-03-24 22:11:...|                  null|    null|
|                null|  2016-03-24 22:12:...|0.015625|
|                null|  2016-03-19 15:54:...|   5.375|
|2016-03-19 15:55:...|  2016-03-19 15:55:...| 5.78125|
|2016-03-19 15:56:...|                  null|    null|
|2016-03-24 22:11:...|  2016-03-24 22:11:...| 0.15625|
+--------------------+----------------------+--------+

我需要这个:

+--------------------+----------------------+--------+
|    timestamp_merged|toAppendData_timestamp|   value|
+--------------------+----------------------+--------+
|2016-03-24 22:11:...|                  null|    null|
|2016-03-24 22:12:...|  2016-03-24 22:12:...|0.015625|
|2016-03-19 15:54:...|  2016-03-19 15:54:...|   5.375|
|2016-03-19 15:55:...|  2016-03-19 15:55:...| 5.78125|
|2016-03-19 15:56:...|                  null|    null|
|2016-03-24 22:11:...|  2016-03-24 22:11:...| 0.15625|
+--------------------+----------------------+--------+

我试过这个,但没有成功:

appendedData = appendedData['timestamp'].fillna(appendedData['toAppendData_timestamp'])

1 个答案:

答案 0 :(得分:1)

您正在寻找的功能是coalesce。您可以从pyspark.sql.functions

导入它
from pyspark.sql.functions import coalesce, col

并使用:

appendedData.withColumn(
    'timestamp_merged', 
    coalesce(col('timestamp'), col('toAppendData_timestamp'))
)