我正在工作spark v1.6。我有以下两个数据帧,我想在我的左外连接结果集中将null转换为0。有什么建议吗?
val x:Array [Int] =数组(1,2,3) val df_sample_x = sc.parallelize(x).toDF(" x")
val y:Array [Int] = Array(3,4,5) val df_sample_y = sc.parallelize(y).toDF(" y")
val df_sample_join = df_sample_x.join(df_sample_y,df_sample_x(" x")=== df_sample_y(" y")," left_outer")
阶> df_sample_join.show
1 |空
2 |空
3 | 3
阶> df_sample_join.show
1 | 0
2 | 0
3 | 3
答案 0 :(得分:8)
只需使用na.fill
:
df.na.fill(0, Seq("y"))
答案 1 :(得分:4)
尝试:
val withReplacedNull = df_sample_join.withColumn("y", coalesce('y, lit(0)))
经过测试:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.types._
val list = List(Row("a", null), Row("b", null), Row("c", 1));
val rdd = sc.parallelize(list);
val schema = StructType(
StructField("text", StringType, false) ::
StructField("y", IntegerType, false) :: Nil)
val df = sqlContext.createDataFrame(rdd, schema)
val df1 = df.withColumn("y", coalesce('y, lit(0)));
df1.show()
答案 2 :(得分:2)
您可以像这样修复现有数据框:
import org.apache.spark.sql.functions.{when,lit}
val correctedDf=df_sample_join.withColumn("y", when($"y".isNull,lit(0)).otherwise($"y"))
虽然T.Gawęda的答案也有效,但我认为这更具可读性