Question

手头的问题 写了一个尝试改进的二元生成器在线上工作，考虑到完全停止等。结果如预期。它不使用mapPartitions，但如下所示。

import org.apache.spark.mllib.rdd.RDDFunctions._

val wordsRdd = sc.textFile("/FileStore/tables/natew5kh1478347610918/NGram_File.txt",10)  
val wordsRDDTextSplit = wordsRdd.map(line => (line.trim.split(" "))).flatMap(x => x).map(x => (x.toLowerCase())).map(x => x.replaceAll(",{1,}","")).map(x => x.replaceAll("!
{1,}",".")).map(x => x.replaceAll("\\?{1,}",".")).map(x => x.replaceAll("\\.{1,}",".")).map(x => x.replaceAll("\\W+",".")).filter(_ != ".")filter(_ != "")  

val x = wordsRDDTextSplit.collect() // need to do this due to lazy evaluation etc. I think, need collect()
val y = for ( Array(a,b,_*) <- x.sliding(2).toArray) 
yield (a, b) 
  val z = y.filter(x => !(x._1 contains ".")).map(x => (x._1.replaceAll("\\.{1,}",""), x._2.replaceAll("\\.{1,}","")))

我有一些问题：

结果如预期。没有遗漏任何数据。但我可以将这种方法转换为mapPartitions方法吗？我不会丢失一些数据吗？很多人说这是因为我们将要处理的分区具有所有单词的子集，因此在分割的边界处错过了关系，即下一个和前一个单词。对于大文件分割，我可以从地图的角度看到这也可能发生。正确的吗？
但是，如果你查看上面的代码（没有mapPartitions尝试），它总是有效，无论我并行多少，10或100指定分区与不同分区连续的单词。我用mapPartitionsWithIndex检查了这个。这个我不清楚。好的，减少（x，y）=＆gt; x + y很好理解。

提前致谢。我必须在这一切中遗漏一些基本观点。

输出＆amp;结果 z：Array [（String，String）] = Array（（你好，如何），（how，are），（是，你），（你，今天），（我，我），（我，很好），（很好，但是），（但是，会），（会，喜欢），（喜欢，），（谈话），（谈话，谈话），（你，你），（你，约），（关于，），（，猫），（他，是），（是，不），（不，做），（做，所以），（好吧），（什么，应该），（应该，我们），（我们，做），（请，帮助），（帮助，我），（嗨，那里），（那里，ged）） map：org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [669] at mapPartitionsWithIndex at：123

分区分配 res13：Array [String] = Array（hello - ＆gt; 0，how - ＆gt; 0，是 - ＆gt; 0，你 - > 0，今天。 - ＆gt; 0，i - ＆gt; 0，am - ＆gt; 32，精细 - > 32，但是 - > 32，将 - >> 32，如 - > 32，到 - > 32，对话 - > 60，到 - > 60，你 - > 60，关于 - > 60， - > 60，猫 - > 60，他 - > 60，是 - > 60，不是 - > 96，做 - > 96，所以 - > 96，好。 - ＆gt; 96，什么 - ＆gt; 96，应该 - > 122，我们 - > 122，做。 - ＆gt; 122，请 - > 122，帮助 - > 122，我。 - ＆gt; 122， hi - > 155，其中 - > 155，ged。 - ＆gt; 155）

可能是SPARK真的很聪明，比我最初的想法更聪明。或者可能不是？在分区保存上看到了一些东西，其中一些是相互矛盾的imho。

map vs mapValues是指前者破坏分区并因此破坏单个分区处理？

Answer 1

您可以使用 mapPartitions 代替用于创建 wordsRDDTextSplit 的任何地图，但我真的没有理由这样做。如果您不想为RDD中的每条记录付费，那么 mapPartitions 最有用。

无论您使用 map 还是 mapPartitions 来创建 wordsRDDTextSplit ，您的滑动窗口都不会对任何内容进行操作，直到您创建本地数据结构 x 。

SPARK N-gram＆amp;并行化不使用mapPartitions

1 个答案: