Question

我正在将文件从HDFS加载到JavaRDD，并希望更新RDD。为此，我将其转换为IndexedRDD（https://github.com/amplab/spark-indexedrdd）并且我无法获得Classcast Exception。基本上我会创建键值对并更新密钥。 IndexedRDD支持更新。有没有办法转换？

JavaPairRDD<String, String> mappedRDD =  lines.flatMapToPair( new PairFlatMapFunction<String, String, String>()
    {
        @Override
        public Iterable<Tuple2<String, String>> call(String arg0) throws Exception {

            String[] arr = arg0.split(" ",2);
            System.out.println( "lenght" + arr.length);
             List<Tuple2<String, String>> results = new ArrayList<Tuple2<String, String>>();
             results.addAll(results);
            return results;
        }
    });        

    IndexedRDD<String,String> test = (IndexedRDD<String,String>) mappedRDD.collectAsMap();

Answer 1

collectAsMap()会返回java.util.Map，其中包含JavaPairRDD中的所有条目，但与Spark无关。我的意思是，该函数是在一个节点中收集值并使用普通Java。因此，您无法将其转换为IndexedRDD或任何其他RDD类型，因为它只是普通Map。

我没有使用IndexedRDD，但是从示例中你可以看到你需要通过传递给它的构造函数PairRDD来创建它：

// Create an RDD of key-value pairs with Long keys.
val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
// Construct an IndexedRDD from the pairs, hash-partitioning and indexing
// the entries.
val indexed = IndexedRDD(rdd).cache()

所以在你的代码中它应该是：

IndexedRDD<String,String> test = new IndexedRDD<String,String>(mappedRDD.rdd());

Spark RDD更新

1 个答案: