Question

我正在使用带有sc.textFile(fileLocation)的spark读取文本文件，并且需要能够快速删除第一行和最后一行（它们可以是标题或预告片）。我找到了返回第一行和最后一行的好方法，但没有一个好的方法可以删除它们。这可能吗？

Answer 1

执行此操作的一种方法是zipWithIndex，然后过滤掉索引为0和count - 1的记录：

// We're going to perform multiple actions on this RDD,
// so it's usually better to cache it so we don't read the file twice
rdd.cache()

// Unfortunately, we have to count() to be able to identify the last index
val count = rdd.count()
val result = rdd.zipWithIndex().collect {
  case (v, index) if index != 0 && index != count - 1 => v
}

请注意，这在性能方面可能相当昂贵（如果您缓存RDD - 您耗尽内存;如果不这样做，则会读取RDD两次）。因此，如果你有任何方法根据内容识别这些记录（例如，如果您知道所有记录但这些记录应包含某种模式），则使用{{1}可能会更快。

Answer 2

这可能是更轻的版本：

val rdd = sc.parallelize(Array(1,2,3,4,5,6), 3)
val partitions = rdd.getNumPartitions
val rddFirstLast = rdd.mapPartitionsWithIndex { (idx, iter) =>
  if (idx == 0) iter.drop(1)
  else if (idx == partitions - 1) iter.sliding(2).map(_.head)
  else iter
}

scala> rddFirstLast.collect()
res3: Array[Int] = Array(2, 3, 4, 5)

Answer 3

这是我的看法，可能需要进行操作（计数），始终获得预期结果，并且与分区数无关。

val rddRowCount = rdd.count()
val rddWithIndices = rdd.zipWithIndex()
val filteredRddWithIndices = rddWithIndices.filter(eachRow =>
  if(eachRow._2 == 0) false
  else if(eachRow._2 == rddRowCount - 1) false
  else true
)
val finalRdd = filteredRddWithIndices.map(eachRow => eachRow._1)

使用Spark删除RDD的第一行和最后一行

3 个答案: