有一个数据集imputedcsv我想在“性别”列中随机替换“男性”或“女性”中的空值。
imputedcsv.groupBy("Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
| null| 24|
|Female| 240|
| Male| 242|
+------+-----+
用单个值填充空值,但是如何从一组值中随机填充列的空值说{Male,Female}
imputedcsv.na.fill("Male", Seq("Gender")).groupBy("Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
|Female| 240|
| Male| 266|
+------+-----+
我需要随机填充Male
或Male
,而不是仅将空值替换为Female
一个值。
使用sample(c('Male','Female'))
对于单值,我们有How to replace null values with a specific value in Dataframe using spark in Java?
感谢任何帮助。
答案 0 :(得分:1)
如果您认为性别的平等概率为Female
或Male
,您可以执行以下操作:
df.withColumn( "gender",
coalesce($"gender",
when(round(rand).cast("int") === lit(0) , lit("Male") )
.otherwise(lit("Female"))
)).show
coalesce
使其仅适用于null
值。
round(rand).cast("int")
每次都会生成0
或1
,Male
或Female
将由when - otherwise
结构决定。
答案 1 :(得分:0)
您可以使用when & otherwise
和withColumn
来实现这一目标,如下所示:
scala> df.groupBy("Gender").count.show
+------+-----+
|Gender|count|
+------+-----+
| null| 2|
|female| 4|
| male| 4|
+------+-----+
scala> df.withColumn("gender", when(($"gender".isNull), "male").otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female| 4|
| male| 6|
+------+-----+
我错过了randomly
,你可以像下面那样实现它:
scala> val gender_set = Set("male","female")
gender_set: scala.collection.immutable.Set[String] = Set(male, female)
scala> import scala.util.Random
import scala.util.Random
scala> val rnd=new Random
rnd: scala.util.Random = scala.util.Random@668b5a55
scala> df.withColumn("gender", when(($"gender".isNull), gender_set.toVector(rnd.nextInt(gender_set.size))).otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female| 4|
| male| 6|
+------+-----+
scala> df.withColumn("gender", when(($"gender".isNull), gender_set.toVector(rnd.nextInt(gender_set.size))).otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female| 6|
| male| 4|
+------+-----+
感谢。
答案 2 :(得分:0)
我需要将@Learner 的代码放在 UDF 中才能工作,否则会出错。
df.groupBy($"Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
| null| 3|
|Female| 3|
| Male| 2|
+------+-----+
val gender_set = Set("Male","Female")
val randGenderUDF = udf(() =>
gender_set.toVector(rnd.nextInt(gender_set.size))
)
df.withColumn("Gender", when($"Gender".isNull, randGenderUDF()).otherwise($"Gender")).groupBy($"Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
|Female| 5|
| Male| 3|
+------+-----+