我是scala spark及其MLlib的新手,目前我正在努力应对一个我不知道为什么会发生错误的错误。
我有一个包含多个分区的RDD,包含这样的数据(从take(#)输出):
Array[TermDoc] = Array(TermDoc(142389495503925248,Set(NEU),ArrayBuffer(salg, veotv, día, largooooo)), TermDoc(142389933619945473,Set(NEU),ArrayBuffer(librar, ayudar, bes, graci)), TermDoc(142391947707940864,Set(P),ArrayBuffer(graci, mar)), TermDoc(142416095012339712,Set(N+),ArrayBuffer(off, pensand, regalit, sind, va, sgae, van, corrupt, intent, sacar, conclusion, intent)), TermDoc(142422495721562112,Set(P+),ArrayBuffer(conozc, alguien, q, adict, dram, ja, ja, ja, suen, d)), TermDoc(142424715175280640,Set(NEU),ArrayBuffer(rt, si, amas, alguien, dejal, libr, si, grit, hombr, paurubi)), TermDoc(142483342040907776,Set(P+),ArrayBuffer(toca, grabacion, dl, especial, navideñ, mari, crism)), TermDoc(142493511634259968,Set(NEU))
由于有输出,我假设RDD不为空,但是当我尝试执行时:
val count = rdd.count()
java.lang.UnsupportedOperationException: empty.init
at scala.collection.TraversableLike$class.init(TraversableLike.scala:475)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$init(ArrayOps.scala:108)
at scala.collection.IndexedSeqOptimized$class.init(IndexedSeqOptimized.scala:129)
at scala.collection.mutable.ArrayOps$ofRef.init(ArrayOps.scala:108)
at $line24.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$TweetParser$.buildDocument(<console>:58)
at $line24.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$TweetParser$$anonfun$2.apply(<console>:49)
at $line24.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$TweetParser$$anonfun$2.apply(<console>:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1598)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/03/13 10:15:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.UnsupportedOperationException: empty.init
at scala.collection.TraversableLike$class.init(TraversableLike.scala:475)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$init(ArrayOps.scala:108)
at scala.collection.IndexedSeqOptimized$class.init(IndexedSeqOptimized.scala:129)
at scala.collection.mutable.ArrayOps$ofRef.init(ArrayOps.scala:108)
at $line24.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$TweetParser$.buildDocument(<console>:58)
at $line24.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$TweetParser$$anonfun$2.apply(<console>:49)
at $line24.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$TweetParser$$anonfun$2.apply(<console>:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1598)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/03/13 10:15:11 ERROR scheduler.TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job
17/03/13 10:15:11 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 3, localhost): TaskKilled (killed intentionally)
17/03/13 10:15:11 WARN spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
17/03/13 10:15:11 ERROR scheduler.LiveListenerBus: Listener SQLListener threw an exception
java.lang.NullPointerException
at org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.UnsupportedOperationException: empty.init
at scala.collection.TraversableLike$class.init(TraversableLike.scala:475)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$init(ArrayOps.scala:108)
at scala.collection.IndexedSeqOptimized$class.init(IndexedSeqOptimized.scala:129)
at scala.collection.mutable.ArrayOps$ofRef.init(ArrayOps.scala:108)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$TweetParser$.buildDocument(<console>:58)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$TweetParser$$anonfun$2.apply(<console>:49)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$TweetParser$$anonfun$2.apply(<console>:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1598)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1843)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1869)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1940)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:62)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:69)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:71)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:73)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:75)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:77)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:79)
at $iwC$$iwC$$iwC.<init>(<console>:81)
at $iwC$$iwC.<init>(<console>:83)
at $iwC.<init>(<console>:85)
at <init>(<console>:87)
at .<init>(<console>:91)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1045)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1326)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:821)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:852)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:800)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1064)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.UnsupportedOperationException: empty.init
at scala.collection.TraversableLike$class.init(TraversableLike.scala:475)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$init(ArrayOps.scala:108)
at scala.collection.IndexedSeqOptimized$class.init(IndexedSeqOptimized.scala:129)
at scala.collection.mutable.ArrayOps$ofRef.init(ArrayOps.scala:108)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$TweetParser$.buildDocument(<console>:58)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$TweetParser$$anonfun$2.apply(<console>:49)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$TweetParser$$anonfun$2.apply(<console>:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1598)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
显然,它是说我试图在空RDD上打电话。发生了什么?它也失败了这一行:
val terms = termDocsRdd.flatMap(_.terms).distinct().sortBy(identity)
相同 empty.init 例外。
感谢。
更新:添加所需信息
object TweetParser extends Serializable{
val headerPart = "polarity"
val mentionRegex = """@(.)+?\s""".r
val fullRegex = """(\d+),(.+?),(N|P|NEU|NONE)(,\w+|;\w+)*""".r
def parseAll(csvFiles: Iterable[String], sc: SparkContext): RDD[Document] = {
val csv = sc.textFile(csvFiles mkString ",")
//val docs = scala.collection.mutable.ArrayBuffer.empty[Document]
val docs = csv.filter(!_.contains(headerPart)).map(buildDocument(_))
docs
//docs.filter(!_.docId.equals("INVALID"))
}
def buildDocument(line: String): Document = {
val lineSplit = line.split(",")
val id = lineSplit.head
val txt = lineSplit.tail.init.init.mkString(",")
val sent = lineSplit.init.last
val opt = lineSplit.last
if (id != null && txt != null && sent != null) {
if (txt.equals("")) {
//the line does not contain the option after sentiment
new Document(id, mentionRegex.replaceAllIn(sent, ""), Set(opt))
} else {
new Document(id, mentionRegex.replaceAllIn(txt, ""), Set(sent))
}
} else {
println("Invalid")
new Document("INVALID")
}
}
}
case class Document(docId: String, body: String = "", labels: Set[String] = Set.empty)
Tokenizer对象:
import java.io.StringReader
import org.apache.lucene.analysis.es.SpanishAnalyzer
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
import org.apache.lucene.util.Version
import org.apache.spark.rdd.RDD
object Tokenizer extends Serializable {
//val LuceneVersion = Version.LUCENE_5_1_0
def tokenizeAll(docs: RDD[Document]) = docs.map(tokenize)
def tokenize(doc: Document): TermDoc = TermDoc(doc.docId, doc.labels, tokenize(doc.body))
def tokenize(content: String): Seq[String] = {
val result = scala.collection.mutable.ArrayBuffer.empty[String]
/*content.split("\n").foreach(line => line.split(" ").foreach(
word => if (word.startsWith("#")) result += word.substring(1) else word
))*/
val analyzer = new SpanishAnalyzer()
analyzer.setVersion(Version.LUCENE_5_1_0)
val tReader = new StringReader(content)
val tStream = analyzer.tokenStream("", tReader)
val term = tStream.addAttribute(classOf[CharTermAttribute])
tStream.reset()
while (tStream.incrementToken()) {
val termValue = term.toString
if (termValue.startsWith("#")) {
result += termValue.substring(1)
}
else {
result += termValue
}
}
result
}
}
case class TermDoc(doc: String, labels: Set[String], terms: Seq[String])
驱动:
val csvFiles = List("/path/to/file.csv", "/path/to/file2.csv", "/path/to/file3.csv")
val docs = TweetParser.parseAll(csvFiles, sc)
val termDocsRdd = Tokenizer.tokenizeAll(docs)
val numDocs = termDocsRdd.count()
val terms = termDocsRdd.flatMap(_.terms).distinct().sortBy(identity)
我在spark-shell上测试了这个。这就是为什么司机看起来像这样的原因。希望这能澄清这个问题。
答案 0 :(得分:2)
显然,它是说我试图在空RDD上调用计数
实际上 - 不,这不是错误所说的。 count
触发此RDD的计算,并在计算其中一个RDD记录时抛出此异常。
具体来说,错误说明:
java.lang.UnsupportedOperationException:empty.init
这可能是buildDocument
中的其中一个表达式抛出的:
val txt = lineSplit.tail.init.init.mkString(",")
val sent = lineSplit.init.last
此代码片段假定lineSplit
是一个至少3个元素的集合 - 您看到的异常是该假设对于至少一个记录不正确的结果:例如,如果lineSplit
只有2个元素,lineSplit.tail.init
将是一个空集合,因此lineSplit.tail.init.init
会抛出您看到的异常。
为了克服这个问题 - 你可以重写你的“解析”方法来正确处理数据中的这种不规则性:
用Try(...)
包裹并仅过滤成功的记录,例如:
import scala.util.{Try, Success}
def parseAll(csvFiles: Iterable[String], sc: SparkContext): RDD[Document] = {
val csv = sc.textFile(csvFiles mkString ",")
val docs = csv.filter(!_.contains(headerPart))
.map(s => Try(buildDocument(s)))
.collect { case Success(v) => v }
docs
}
更改解析,以便将lineSplit
的“缺失”部分设置为null
(如以下几行所示),例如:
def buildDocument(line: String): Document = {
val (id, txt, sent, opt) = line.split(",").padTo(5, null) match {
case Array(a,b,c,d,e,_*) => (a, s"$b,$c", d, e)
}
// continue as before....
}