Question

我正在尝试在使用Java的Spark中使用DataFrame上的map函数。我正在关注

的文档

map（scala.Function1 f，scala.reflect.ClassTag证据$ 4）通过将函数应用于此DataFrame的所有行来返回新的# Example: set of data frames in a list list.of.data.sets <- list(a=data.frame(x=1:10, y=1:10), b=data.frame(x=1:10, y=11:20), c=data.frame(x=1:10, y=21:30) ) # The meta function where you define all the things you want to do to your data sets: my.meta.function <- function(my.data, color.parameter, size.parameter){ plot(y~x, data=my.data, cex=size.parameter, col=color.parameter) my.mean <- mean(my.data$y) return(my.mean) } # Call the function for each data set with a for-loop: for(i in 1:length(list.of.data.sets)){ my.meta.function(my.data=list.of.data.sets[[i]], size.parameter=4, color.parameter=20) } # Call the function for each data set with lapply (faster!): results.of.all.data.sets <- lapply(list.of.data.sets, FUN=my.meta.function, size.parameter=4, color.parameter=20)。

在地图中使用Function1时，我需要实现所有功能。我看到一些questions与此相关，但提供的解决方案会将RDD转换为DataFrame。如何在RDD中使用地图功能而不将其转换为DataFrame也是地图的第二个参数，即RDD

我正在使用 Java 7 和 Spark 1.6 。

Answer 1

我认为map不适合在DataFrame上使用。也许你应该看一下例子in the API

他们展示了如何在DataFrame s

上进行操作

Answer 2

您可以直接使用数据集，无需将读取数据转换为RDD，其不必要的资源消耗。

dataset.map（mapfuncton {...}，encoder）;这应该足以满足你的需求。

Answer 3

我知道您的问题是关于Java 7和Spark 1.6的，但是在Spark 2（显然是Java 8）中，您可以将map函数作为类的一部分，因此您不需要操纵Java lambda。

通话如下：

Dataset<String> dfMap = df.map(
    new CountyFipsExtractorUsingMap(),
    Encoders.STRING());
dfMap.show(5);

该类如下：

  /**
   * Returns a substring of the values in the id2 column.
   * 
   * @author jgp
   */
  private final class CountyFipsExtractorUsingMap
      implements MapFunction<Row, String> {
    private static final long serialVersionUID = 26547L;

    @Override
    public String call(Row r) throws Exception {
      String s = r.getAs("id2").toString().substring(2);
      return s;
    }
  }

您可以在this example on GitHub中找到更多详细信息。

Answer 4

由于您没有遇到任何具体问题，map中有DataFrame的一些常见替代方法，例如select，selectExpr，withColumn。如果spark sql内置函数无法满足您的任务，则可以使用UTF。

如何使用Java在Spark DataFrame中应用map函数？

4 个答案: