问题:
我有Array[Array[String]]
形式的RDD,我需要在内部数组中组合字符串。但是当我应用地图功能时,我收到以下错误
java.io.NotSerializableException: scala.collection.TraversableOnce$FlattenOps$$anon$1
Serialization stack:
- object not serializable (class: scala.collection.TraversableOnce$FlattenOps$$anon$1, value: non-empty iterator)
- element of array (index: 0)
- array (class [Lscala.collection.Iterator;, size 10)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
背景
最初我有以下内容:
Array[org.apache.spark.sql.Row] = Array([cyber crimes ;; cyber security ;; review ;; india ;; instances ;; state ;; issue], [civil rights ;; case ;; instances ;; frequency])
当我使用以下代码清理时:
words.map(r => r(0).asInstanceOf[String].split("\\;;").map(_.trim))
结果如下:
Array[Array[String]] = Array(Array(cyber crimes, cyber security, review, india, instances, state, issue), Array(civil society, instances, frequency))
现在我需要所有可能的数组中的字符串组合,如:
Array[Array[String]] = Array(Array((cyber crimes, cyber security), (review, india), (instances, state), (issue,cyber crimes))....etc)
当我对此应用map
时,它会给我上述错误:
val combinations = cleanwords.map(r => r(0).asInstanceOf[String].combinations(2))
任何人都可以帮助我获得这个理想的结果吗?
答案 0 :(得分:1)
发生错误可能是因为尝试收集元素为迭代器的rdd(由combinations
生成)。此外,您需要直接在数组上使用combinations
:
cleanwords.map(_.combinations(2).toArray).collect
// res47: Array[Array[Array[String]]] = Array(Array(Array(cyber crimes, cyber security), Array(cyber crimes, review), Array(cyber crimes, india) ..
要取回元组:
cleanwords.map(_.combinations(2).map(x => (x(0), x(1))).toArray).collect
// res60: Array[Array[(String, String)]] = Array(Array((cyber crimes,cyber security), (cyber crimes,review), (cyber crimes,india) ..