Question

我想遍历spark DataFrame中一列的内容，并在满足特定条件的情况下更正单元格中的数据

+-------------+
|column_title |
+-------------+
+-----+
|null |
+-----+
+-----+
|0    |
+-----+
+-----+
|1    |
+-----+

让我们说我想在column的值为null时显示其他内容，我尝试使用

Column.when() DataSet.withColumn()

但是我找不到正确的方法，我认为没有必要转换为RDD并对其进行迭代。

Answer 1

您可以使用when和equalTo或when和isNull。

Dataset<Row> df1 = df.withColumn("value", when(col("value").equalTo("bbb"), "ccc").otherwise(col("value")));

Dataset<Row> df2 = df.withColumn("value", when(col("value").isNull(), "ccc").otherwise(col("value")));

如果您只想替换空值，则也可以使用na和fill。

Dataset<Row> df3 = df.na().fill("ccc");

Answer 2

执行此操作的另一种方法是使用UDF。

创建UDF。

    private static UDF1 myUdf = new UDF1<String, String>() {
    public String call(final String str) throws Exception {
        // any condition or custom function can be used
        return StringUtils.rightPad(str, 25, 'A');
      }
    };

在SparkSession中注册UDF。

    sparkSession.udf().register("myUdf", myUdf, DataTypes.StringType);

在数据集上应用udf。

   Dataset<Row> dataset = dataset.withColumn("city", functions.callUDF("myudf", col("city")));

希望有帮助！

列

2 个答案: