Question

我有这样的数据集

Date       Buyer_id
11.11.2016  1
11.11.2016  2
11.11.2016  2
13.12.2016  1
13.12.2016  3
13.12.2016  4
14.12.2016  3
14.12.2016  1

我想用其各自类别的平均值填充NAN值。如下图所示

id    category     value
1     A            NaN
2     B            NaN
3     A            10.5
5     A            2.0
6     B            1.0

我尝试使用group by

计算每个类别的第一个平均值

id    category     value
1     A            4.16
2     B            0.5
3     A            10.5
5     A            2.0
6     B            1.0

我得到了每个类别的地图及其各自的平均值。val df2 = dataFrame.groupBy(category).agg(mean(value)).rdd.map{ case r:Row => (r.getAs[String](category),r.get(1)) }.collect().toMap println(df2) 现在我尝试在Sparksql中更新查询来填充列，但似乎spqrkSql dosnt支持更新查询。我试图在数据帧中填充空值但未能这样做。我能做什么？我们可以在pandas中做同样的事情，如Pandas: How to fill null values with mean of a groupby?所示但是我怎么能使用spark数据帧

Answer 1

确实，您无法更新 DataFrames，但您可以使用select和join等函数转换它们。在这种情况下，您可以将分组结果保持为DataFrame并将其（在category列）加入原始分组结果，然后执行将NaN替换为import org.apache.spark.sql.functions._ import spark.implicits._ // calculate mean per category: val meanPerCategory = dataFrame.groupBy("category").agg(mean("value") as "mean") // use join, select and "nanvl" function to replace NaNs with the mean values: val result = dataFrame .join(meanPerCategory, "category") .select($"category", $"id", nanvl($"value", $"mean")).show()的映射值：

public function checkTitle() {
    $title = $this->html->find('title',0)->innertext;
    $test = imagettfbbox(18, 0, 'arial.tff', $title);
}

Answer 2

最简单的解决方案是使用groupby和join：

 val df2 = df.filter(!(isnan($"value"))).groupBy("category").agg(avg($"value").as("avg"))
 df.join(df2, "category").withColumn("value", when(col("value").isNaN, $"avg").otherwise($"value")).drop("avg")

请注意，如果有一个包含所有NaN的类别，它将从结果中删除

Answer 3

我偶然发现了同样的问题，并发现了这篇文章。但是尝试了另一种解决方案，即使用窗口函数。以下代码在pyspark 2.4.3上进行了测试（Spark 1.4提供了Window函数）。我相信这是更清洁的解决方案。这篇文章很旧，但是希望这个答案对其他人有帮助。

from pyspark.sql import Window
from pyspark.sql.functions import *

df = spark.createDataFrame([(1,"A", None), (2,"B", None), (3,"A",10.5), (5,"A",2.0), (6,"B",1.0)], ['id', 'category', 'value'])

category_window = Window.partitionBy("category")
value_mean = mean("value0").over(category_window)

result = df\
  .withColumn("value0", coalesce("value", lit(0)))\
  .withColumn("value_mean", value_mean)\
  .withColumn("new_value", coalesce("value", "value_mean"))\
  .select("id", "category", "new_value")

result.show()

输出将与预期的一样（有问题）

id  category    new_value       
1   A   4.166666666666667
2   B   0.5
3   A   10.5
5   A   2
6   B   1

使用Spark Dataframe中另一个分类列的平均值替换列的空值

3 个答案: