Question

我已经从一个文件构建了一个RDD，其中RDD中的每个元素都是由分隔符分隔的文件中的部分。

val inputRDD1:RDD[(String,Long)] = myUtilities.paragraphFile(spark,path1)
                                              .coalesce(100*spark.defaultParallelism) 
                                              .zipWithIndex() //RDD[String, Long]
                                              .filter(f => f._2!=0)

我上面的最后一个操作（过滤器）的原因是删除第一个索引0.

有没有更好的方法来删除第一个元素，而不是像上面那样检查每个元素的索引值？

谢谢！

Answer 1

一种可能性是使用RDD.mapPartitionsWithIndex并从索引0处的迭代器中删除第一个元素：

val inputRDD = myUtilities
                .paragraphFile(spark,path1)
                .coalesce(100*spark.defaultParallelism) 
                .mapPartitionsWithIndex(
                   (index, it) => if (index == 0) it.drop(1) else it,
                    preservesPartitioning = true
                 )

这样，您只需要在第一个迭代器上前进一个项目，其他所有项目保持不变。这样效率更高吗？大概。无论如何，我测试两个版本，看看哪个版本表现更好。

删除RDD中的第一个元素而不使用过滤功能

1 个答案: