如何在SPARK数据帧v1.6中的左外连接中将NULL替换为0

时间:2016-11-23 18:55:31

标签: scala apache-spark spark-dataframe apache-spark-1.6

我正在工作spark v1.6。我有以下两个数据帧,我想在我的左外连接结果集中将null转换为0。有什么建议吗?

DataFrames

val x:Array [Int] =数组(1,2,3) val df_sample_x = sc.parallelize(x).toDF(" x")

val y:Array [Int] = Array(3,4,5) val df_sample_y = sc.parallelize(y).toDF(" y")

左外连接

val df_sample_join = df_sample_x.join(df_sample_y,df_sample_x(" x")=== df_sample_y(" y")," left_outer")

结果集

阶> df_sample_join.show

x | ÿ

1 |空

2 |空

3 | 3

但我希望结果集显示为。

阶> df_sample_join.show

x | ÿ

1 | 0

2 | 0

3 | 3

3 个答案:

答案 0 :(得分:8)

只需使用na.fill

df.na.fill(0, Seq("y"))

答案 1 :(得分:4)

尝试:

val withReplacedNull = df_sample_join.withColumn("y", coalesce('y, lit(0)))

经过测试:

import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.types._

val list = List(Row("a", null), Row("b", null), Row("c", 1));
val rdd = sc.parallelize(list);

val schema = StructType(
    StructField("text", StringType, false) ::
    StructField("y", IntegerType, false) :: Nil)

val df = sqlContext.createDataFrame(rdd, schema)
val df1 = df.withColumn("y", coalesce('y, lit(0)));
df1.show()

答案 2 :(得分:2)

您可以像这样修复现有数据框:

import org.apache.spark.sql.functions.{when,lit}
val correctedDf=df_sample_join.withColumn("y", when($"y".isNull,lit(0)).otherwise($"y"))

虽然T.Gawęda的答案也有效,但我认为这更具可读性