Question

我们将一些HDFS文件写为级联序列文件，我们要使用Apache Spark处理这些文件。我尝试使用JavaPairRDD来读取键值对，如下所示：

    JavaPairRDD<String, String> input = ctx.sequenceFile("file-path", String.class, String.class);

在运行此作业时，我收到以下错误：

java.io.IOException: Could not find a deserializer for the Key class: 
'cascading.tuple.Tuple'. 
Please ensure that the configuration 'io.serializations' is properly configured, 
if you're using custom serialization.

我是使用Apache Spark的新手。我已经尝试在spark上下文对象中设置序列化类，但我仍然遇到此错误。我还没有在Spark中找到使用级联序列文件的单个示例操作系统。任何帮助将不胜感激。

Answer 1

我找到了解决方案。要对其进行反序列化，必须设置hadoop配置。这可以这样做：

  JavaSparkContext ctx = new JavaSparkContext(sparkConf);
  ctx.hadoopConfiguration().set("io.serializations","cascading.tuple.hadoop.TupleSerialization");

这是因为hadoop从hadoop conf获取其io.serializations而不是来自spark conf。因此，在sparkConf中设置此io.serializations是没有用的。我希望它有助于面对这个问题的人。

在Spark中读取级联序列文件

1 个答案: