Question

我不想使用databricks API，因为我们遇到了一些问题。

我希望在Java 1.7，Spark 1.6.2中将DF转换为RDD并从RDD转换为textfile

我希望我的数据框保存为文本文件，我知道如果我们使用Java 1.8，下面的代码可以正常工作

df.rdd.map(row => row.mkString("\t")).coalesce(1).saveAsTextFile("outputDirRdd")

但是我试图在Java 1.7中使用相同的上述代码我无法获得正确的语法并且使用下面的代码。

df.toJavaRDD().map(new Function<???,???>() {
        public ???  call(?? input) throws Exception {

        ?????

        }
    }).coalesce(1).saveAsTextFile("/s/filelocation");

我不知道上述代码是否正确。

请帮助我，提前致谢。

Answer 1

对于上述用例，使用Java 1.7和Apache Spark的正确语法如下：

df.toJavaRDD().map(new Function<Row, String>() {
                    @Override
                    public String call(Row o) throws Exception {
                        return o.mkString("\t");
                    }
                }).coalesce(1).saveAsTextFile("/s/filelocation");

此处Row，org.apache.spark.sql.Row是输入数据类型，String是输出数据类型。

call函数将Row作为输入参数，并返回String作为输出。这就是call的签名为public String call(Row o) throws Exception {}的原因。

Answer 2

@Synthe This is how the issue is solved.

This below peace of code ran me into serialization issues for all its super classes and there are few classes where i cannot change them.

df.toJavaRDD().map(new Function<Row, String>() {
                public String call(Row v1) throws Exception {
                    return v1.mkString("\t");
                }
            }).saveAsTextFile("/s/filelocation");

So for that the workaround is below:

df.toJavaRDD().map(new SeprateCls).saveAsTextFile("/s/filelocation");

The below code is the creation of seprateCls

public class SeprateCls implements Function<Row, String>{

private static final long serialVersionUID = -635027754589291L;

public String call(Row v1) throws Exception {
    return v1.mkString("\t");
}

}

不使用databricks API将数据框保存为文本文件

2 个答案: