从集合中随机替换spark数据集列值

时间:2017-11-21 06:23:24

标签: apache-spark apache-spark-sql spark-dataframe

有一个数据集imputedcsv我想在“性别”列中随机替换“男性”或“女性”中的空值。

imputedcsv.groupBy("Gender").count.show()

+------+-----+
|Gender|count|
+------+-----+
|  null|   24|
|Female|  240|
|  Male|  242|
+------+-----+

用单个值填充空值,但是如何从一组值中随机填充列的空值说{Male,Female}

imputedcsv.na.fill("Male", Seq("Gender")).groupBy("Gender").count.show()

+------+-----+
|Gender|count|
+------+-----+
|Female|  240|
|  Male|  266|
+------+-----+

我需要随机填充MaleMale,而不是仅将空值替换为Female一个值。

使用sample(c('Male','Female'))

之类的东西

对于单值,我们有How to replace null values with a specific value in Dataframe using spark in Java?

感谢任何帮助。

3 个答案:

答案 0 :(得分:1)

如果您认为性别的平等概率为FemaleMale,您可以执行以下操作:

df.withColumn( "gender", 
    coalesce($"gender", 
                 when(round(rand).cast("int") === lit(0) , lit("Male") )
                 .otherwise(lit("Female"))
             )).show

coalesce使其仅适用于null值。 round(rand).cast("int")每次都会生成01MaleFemale将由when - otherwise结构决定。

答案 1 :(得分:0)

您可以使用when & otherwisewithColumn来实现这一目标,如下所示:

scala> df.groupBy("Gender").count.show

+------+-----+
|Gender|count|
+------+-----+
|  null|    2|
|female|    4|
|  male|    4|
+------+-----+

scala> df.withColumn("gender", when(($"gender".isNull), "male").otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female|    4|
|  male|    6|
+------+-----+

我错过了randomly,你可以像下面那样实现它:

scala> val gender_set = Set("male","female")
gender_set: scala.collection.immutable.Set[String] = Set(male, female)

scala> import scala.util.Random
import scala.util.Random

scala>  val rnd=new Random
rnd: scala.util.Random = scala.util.Random@668b5a55

scala> df.withColumn("gender", when(($"gender".isNull), gender_set.toVector(rnd.nextInt(gender_set.size))).otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female|    4|
|  male|    6|
+------+-----+


scala> df.withColumn("gender", when(($"gender".isNull), gender_set.toVector(rnd.nextInt(gender_set.size))).otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female|    6|
|  male|    4|
+------+-----+

感谢。

答案 2 :(得分:0)

我需要将@Learner 的代码放在 UDF 中才能工作,否则会出错。

df.groupBy($"Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
|  null|    3|
|Female|    3|
|  Male|    2|
+------+-----+
val gender_set = Set("Male","Female")

val randGenderUDF = udf(() => 
   gender_set.toVector(rnd.nextInt(gender_set.size))
)

df.withColumn("Gender", when($"Gender".isNull, randGenderUDF()).otherwise($"Gender")).groupBy($"Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
|Female|    5|
|  Male|    3|
+------+-----+