Question

在数据集中读取csv文件后，想要使用Java API从String类型数据中删除空格。

Apache Spark 2.0.0

Dataset<Row> dataset = sparkSession.read().format("csv").option("header", "true").load("/pathToCsv/data.csv");
Dataset<String> dataset2 = dataset.map(new MapFunction<Row,String>() {

    @Override
    public String call(Row value) throws Exception {

        return value.getString(0).replace(" ", ""); 
        // But this will remove space from only first column
    }
}, Encoders.STRING());

使用MapFunction，无法从所有列中删除空格。

但是在Scala中，通过在spark-shell中使用能够执行所需操作的以下方式。

val ds = spark.read.format("csv").option("header", "true").load("/pathToCsv/data.csv")
val opds = ds.select(ds.columns.map(c => regexp_replace(col(c), " ", "").alias(c)): _*)

数据集opds包含没有空格的数据。想在Java中实现相同的功能。但是在Java API columns方法中返回String[]并且无法在Dataset上执行函数式编程。

输入数据

+----------------+----------+-----+---+---+
|               x|         y|    z|  a|  b|
+----------------+----------+-----+---+---+
|     Hello World|John Smith|There|  1|2.3|
|Welcome to world| Bob Alice|Where|  5|3.6|
+----------------+----------+-----+---+---+

预期输出数据

+--------------+---------+-----+---+---+
|             x|        y|    z|  a|  b|
+--------------+---------+-----+---+---+
|    HelloWorld|JohnSmith|There|  1|2.3|
|Welcometoworld| BobAlice|Where|  5|3.6|
+--------------+---------+-----+---+---+

Answer 1

尝试：

for (String col: dataset.columns) {
  dataset = dataset.withColumn(col, regexp_replace(dataset.col(col), " ", ""));
}

Answer 2

您可以尝试使用正则表达式删除字符串之间的空格。

value.getString(0).replaceAll("\\s+", "");

关于\ s +：在一次和无限次之间匹配任何空白字符，尽可能多次。而不是替换使用replaceAll函数。

有关替换和替换所有函数Difference between String replace() and replaceAll()

的更多信息

使用Java API

2 个答案: