Question

我试图用我之前的非空值（如果存在）填充我的Spark数据帧中的缺失值。我在Python / Pandas中做过这类事情，但我的数据对于Pandas来说太大了（在一个小集群上）和我的Spark noob。这是Spark能做的吗？它能为多列做到吗？如果是这样，怎么样？如果没有，在Hadoop工具套件中有任何替代方法的建议吗？

谢谢！

Answer 1

通过使用Window here，我找到了无需额外编码的解决方案。所以Jeff是对的，有一个解决方案。完整的代码boelow，我将简要解释它的作用，更多细节请看博客。

from pyspark.sql import Window
from pyspark.sql.functions import last
import sys

# define the window
window = Window.orderBy('time')\
               .rowsBetween(-sys.maxsize, 0)

# define the forward-filled column
filled_column_temperature = last(df6['temperature'], ignorenulls=True).over(window)

# do the fill 
spark_df_filled = df6.withColumn('temperature_filled',  filled_column_temperature)

因此，我们的想法是通过始终包含实际行和所有先前行的数据来定义Window滑动（更多在滑动窗口here上）：

    window = Window.orderBy('time')\
           .rowsBetween(-sys.maxsize, 0)

请注意，我们按时间排序，因此数据的顺序正确。另请注意使用＆＃34; -sys.maxsize＆＃34;确保窗口始终包含所有先前的数据，并且在自上而下遍历数据时会不断增长，但可能会有更高效的解决方案。

使用＆＃34; last＆＃34;函数，我们总是在该窗口的最后一行。通过＆＃34; ignorenulls = True＆＃34;我们定义如果当前行为null，则该函数将返回窗口中最近的（最后一个）非null值。否则，将使用实际行的值。

完成。

在Spark / Python中转发填充缺失值

1 个答案: