我尝试在连接到Cassandra的EMR 5上部署Spark 2.0(流媒体)应用程序。 我使用的Spark-Cassandra连接器是: “com.datastax.spark”%“spark-cassandra-connector_2.11”%“2.0.0-M3”。
该应用程序在我的计算机上独立运行,并成功连接到Cassandra(保存数据)。所有相关的Cassandra端口似乎都在群集中打开。 但我仍然有最低限度的例外。
下面是函数“getCassandraMappedTable”。
class VisitDaoImpl {
override def getCassandraMappedTable():CassandraTableScanRDD[Visit] = {
SparkContextHolder.sparkContext.cassandraTable[Visit](keyspace, tableName)
}
}
相关的访问:
case class Visit(val visitorKey:String, val normalizedDomain:String, val timestamp:Date, val visitId:String, val batchId:Long) extends Serializable
object Visit extends CassandraTable {
import Visit.Columns._
implicit object Mapper extends DefaultColumnMapper[Visit](
Map("visitorKey" -> VISITOR_KEY,
"normalizedDomain" -> NORMALIZED_DOMAIN,
"timestamp" -> TIMESTAMP,
"visitId" -> VISIT_ID))
val TABLE_NAME = "visit"
case object Columns {
val VISITOR_KEY = "visitor_key"
val NORMALIZED_DOMAIN = "normalized_domain"
val TIMESTAMP = "timestamp"
val VISIT_ID = "visit_id"
}
val columnsNames:Seq[ColumnName] = toColumnNames(Columns)
}
我没有合理的理由得到以下异常:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1069.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1069.0 (TID 721, ip-10-0-0-111.eu-west-1.compute.internal): java.lang.NullPointerException
at com.datastax.spark.connector.SparkContextFunctions.cassandraTable$default$3(SparkContextFunctions.scala:52)
at com.naturalint.myproject.daoimpl.VisitDaoImpl.getCassandraMappedTable(VisitDaoImpl.scala:24)
at com.naturalint.myproject.daoimpl.VisitDaoImplVisitDaoImpl.findLatestBetween(VisitDaoImpl.scala:92)
at com.naturalint.myproject.servicesimpl.MyAlgo$$anonfun$processStream$1$$anonfun$apply$2.apply(MyAlgo.scala:122)
at com.naturalint.myproject.servicesimpl.MyAlgo$$anonfun$processStream$1$$anonfun$apply$2.apply(MyAlgo.scala:110)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.util.CompletionIterator.foreach(CompletionIterator.scala:26)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:875)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:875)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
有什么想法吗?
谢谢, 叶兰。