问题更新：

Question

因此，我有一个CSV，其中包含空间（latitude，longitude）和时间（timestamp）数据。

为了对我们有用，我们将空间信息转换为“ geohash”，将时间信息转换为“ timehash”。

问题是，如何为带有火花的CSV中的每一行添加geohash和timehash作为字段（因为数据约为 200 GB ）？< / p>

我们尝试使用JavaPairRDD及其功能mapTopair，但是问题仍然在于如何先转换回JavaRdd然后再转换为CSV？所以我认为这是一个糟糕的解决方案，我要求一种简单的方法。

问题更新：

在@Alvaro帮助之后，我创建了这个Java类：

public class Hash {
public static SparkConf Spark_Config;
public static JavaSparkContext Spark_Context;

UDF2 geohashConverter = new UDF2<Long, Long, String>() {

    public String call(Long latitude, Long longitude) throws Exception {
        // convert here
        return "calculate_hash";
    }
};

UDF1 timehashConverter = new UDF1<Long, String>() {

    public String call(Long timestamp) throws Exception {
        // convert here
        return "calculate_hash";
    }
};
public Hash(String path) {
    SparkSession spark = SparkSession
            .builder()
            .appName("Java Spark SQL Example")
            .config("spark.master", "local")
            .getOrCreate();

    spark.udf().register("geohashConverter", geohashConverter, DataTypes.StringType);
    spark.udf().register("timehashConverter", timehashConverter, DataTypes.StringType);

Dataset df=spark.read().csv(path)
    .withColumn("geohash", callUDF("geohashConverter", col("_c6"), col("_c7")))
    .withColumn("timehash", callUDF("timehashConverter", col("_c1")))
.write().csv("C:/Users/Ahmed/Desktop/preprocess2");

 }

public static void main(String[] args) {
    String path = "C:/Users/Ahmed/Desktop/cabs_trajectories/cabs_trajectories/green/2013";
    Hash h = new Hash(path);
}
}

然后我得到序列化问题，当我删除write().csv()

时，该问题消失了

Answer 1

最有效的方法之一是使用Datasets API加载CSV并使用用户定义函数转换您指定的列。这样，您的数据将始终保持结构，而不必处理元组。

首先，创建您的用户定义函数：geohashConverter，它使用两个值（latitude和longitude），以及timehashConverter，它仅使用时间戳记

UDF2 geohashConverter = new UDF2<Long, Long, String>() {
    @Override
    public String call(Long latitude, Long longitude) throws Exception {
        // convert here
        return "calculate_hash";
    }
};

UDF1 timehashConverter = new UDF1<Long, String>() {
    @Override
    public String call(Long timestamp) throws Exception {
        // convert here
        return "calculate_hash";
    }
};

创建后，您必须注册它们：

spark.udf().register("geohashConverter", geohashConverter, DataTypes.StringType);
spark.udf().register("timehashConverter", timehashConverter, DataTypes.StringType);

最后，只需阅读CSV文件，并通过调用withColumn来应用用户定义的函数。它将基于您使用callUDF调用的用户定义函数创建一个新列。 callUDF始终会收到一个字符串，其中包含您要调用的已注册UDF的名称以及一个或多个列，这些列的值将传递给UDF。

最后，只需调用write().csv("path")

保存您的数据集

import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.callUDF;


spark.read().csv("/source/path")
        .withColumn("geohash", callUDF("geohashConverter", col("latitude"), col("longitude")))
        .withColumn("timehash", callUDF("timehashConverter", col("timestamp")))
.write().csv("/path/to/save");

希望它有所帮助！

更新

如果您发布引起问题的代码，这将非常有帮助，因为该异常几乎没有说明代码的哪些部分不可序列化。

无论如何，从我对Spark的亲身经历来看，我认为问题是您用来计算哈希的对象。请记住，该对象必须通过群集进行分发。如果无法序列化该对象，它将抛出Task not serializable Exception。您有两种解决方法：

在用于计算哈希的类中实现Serializable接口。
创建一个生成哈希值的静态方法，然后从UDF调用此方法。

更新2

然后我得到序列化问题，当我删除时该问题消失了 write（）。csv（）

这是预期的行为。删除write().csv()时，您什么也不执行。您应该知道Spark的工作原理。在这段代码中，在csv()之前调用的所有方法都是转换。在Spark中，只有调用csv()，show()或count()之类的动作后，转换才会执行。

问题在于您正在以不可序列化的类创建和执行Spark Job（甚至在构造函数中更糟！！！）

以静态方法创建Spark作业可以解决此问题。请记住，您的Spark代码必须在集群中分发，因此，它必须是可序列化的。它对我有用，必须对你有用：

public class Hash {
    public static void main(String[] args) {
        String path = "in/prueba.csv";

        UDF2 geohashConverter = new UDF2<Long, Long, String>() {

            public String call(Long latitude, Long longitude) throws Exception {
                // convert here
                return "calculate_hash";
            }
        };

        UDF1 timehashConverter = new UDF1<Long, String>() {

            public String call(Long timestamp) throws Exception {
                // convert here
                return "calculate_hash";
            }
        };

        SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL Example")
                .config("spark.master", "local")
                .getOrCreate();

        spark.udf().register("geohashConverter", geohashConverter, DataTypes.StringType);
        spark.udf().register("timehashConverter", timehashConverter, DataTypes.StringType);

        spark
                .read()
                .format("com.databricks.spark.csv")
                .option("header", "true")
                .load(path)
                .withColumn("geohash", callUDF("geohashConverter", col("_c6"), col("_c7")))
                .withColumn("timehash", callUDF("timehashConverter", col("_c1")))
                .write().csv("resultados");
    }
}

使用Spark将字段添加到CSV

问题更新：

1 个答案:

更新

更新2