Question

我想计算每个二元组的频率。

所以我写了

val intputFile = "bible+shakes.nopunc"
val sentences = sc.textFile(intputFile)

val bigrams = sentences.map(sentence => sentence.trim.split(' ')).flatMap( wordList =>
  for (i <- List.range(0, (wordList.length - 2))) yield ((wordList(i), wordList(i + 1)), 1)
)

val bigrams2 = sentences.map(sentence => sentence.trim.split(' ')).flatMap( wordList =>
  wordList.sliding(2).map{case Array(x, y) => ((x,y), 1)}
)

他们似乎有相同的类型。

scala> bigrams
res11: org.apache.spark.rdd.RDD[((String, String), Int)] = MapPartitionsRDD[7] at flatMap at <console>:28

scala> bigrams2
res12: org.apache.spark.rdd.RDD[((String, String), Int)] = MapPartitionsRDD[11] at flatMap at <console>:28

阶＆GT; bigrams.collect res15：Array [（（String，String），Int）] = Array（（（圣经，圣经），1），（（圣经，授权），1），（（授权，国王），1），（（国王），james），1），（（james，version），1），（（version，textfile），1），（（in，the），1），（（the，beginning），1），（（开头），上帝），1），（（上帝，创造），1），（（创造，），1），（（天堂），1），（（天堂，和），1），（（和，），1），（（和，），1），（（地球），1），（（地球，是），1），（（是，没有），1），（没有，形式），1），（（形式，和），1），（（和，空），1），（（空，和），1），（（和，黑暗），1），（（黑暗）（是），1），（（当时），1），（（on，the），1），（（the，face），1），（（face，of），1），（，），1），（（，深），1），（（深，和），1），（（和，），1），（（，精神），1），（（精神），（），（），（上帝），1），（（上帝，感动），1），（（移动，上），1），（（on，the），1），（（，（面），1），（（面，），1），（（，），1），（（和，上帝），1），（（上帝，说），1），（（。 ..

然而，当我这样做时

scala> bigrams.collect
res13: Array[((String, String), Int)] = Array(((holy,bible),1), ((bible,authorized),1), ((authorized,king),1), ((king,james),1), ((james,version),1), ((version,textfile),1), ((in,the),1), ((the,beginning),1), ((beginning,god),1), ((god,created),1), ((created,the),1), ((the,heaven),1), ((heaven,and),1), ((and,the),1), ((and,the),1), ((the,earth),1), ((earth,was),1), ((was,without),1), ((without,form),1), ((form,and),1), ((and,void),1), ((void,and),1), ((and,darkness),1), ((darkness,was),1), ((was,upon),1), ((upon,the),1), ((the,face),1), ((face,of),1), ((of,the),1), ((the,deep),1), ((deep,and),1), ((and,the),1), ((the,spirit),1), ((spirit,of),1), ((of,god),1), ((god,moved),1), ((moved,upon),1), ((upon,the),1), ((the,face),1), ((face,of),1), ((of,the),1), ((and,god),1), ((god,said),1), ((...

scala> bigrams2.collect
16/10/05 10:17:52 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 20)
scala.MatchError: [Ljava.lang.String;@3224ea91 (of class [Ljava.lang.String;)
    at $line27.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2$$anonfun$apply$1.apply(<console>:29)

bigrams2.take(5)
res25: Array[((String, String), Int)] = Array(((holy,bible),1), ((bible,authorized),1), ((authorized,king),1), ((king,james),1), ((james,version),1))

评估它的第二种方法导致错误。

为什么呢？怎么解决？我更喜欢第二种，精确的方式。

Answer 1

您的bigrams2表达式存在问题：

wordList.sliding(2).map{case Array(x, y) => ((x,y), 1)}

mapBox不处理wordList只有一个项的情况，这就是你得到MatchError的原因。它似乎是单个单词的句子。

修复一个项目数组的添加案例构造：

wordList.sliding(2).map {
  case Array(x, y) => ((x,y), 1)
  case Array(x) => ???
}

Answer 2

wordList.sliding(2).map{case Array(x, y) => ((x,y), 1)的问题在于{case Array(x, y) => ((x,y), 1)是partial-function，只知道如何处理与模式Array(x, y)匹配的输入。

因此你的地图将无法处理只有一个元素的窗口。您应该将其更改为以下内容，

wordList.sliding(2).flatMap {
  case Array(x, y) => Some((x, y), 1)
  case _ => None
}

此处，flatMap将展平Option，从而确保结果仅包含有效的二元组。

Scala - Spark字数，为什么滑动不工作

2 个答案: