我正在尝试使用SparkStreaming(从Kafka队列收集消息)向TitanDB添加元素。但它似乎比预期更难。 这里是Titan连接的定义:
val confPath: String = "titan-cassandra-es-spark.properties"
val conn: TitanModule = new TitanModule(confPath)
Titan模块是一个Serializable
类,用于配置TitanDB连接:
...
val configurationFilePath: String = confFilePath
val configuration = new PropertiesConfiguration(configurationFilePath)
val gConn: TitanGraph = TitanFactory.open(configuration)
...
当我执行从Kafka队列收集消息(json)的sparkStreaming作业时,它会收到消息并尝试将其添加到TitanDB中,它会使用以下stackTrace进行爆炸。
你们知道使用SparkStreaming将数据添加到TitanDB是否可行? 你知道这可能是什么解决方案吗?
18:03:50,596 ERROR JobScheduler:95 - Error running job streaming job 1464624230000 ms.0
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:911)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:910)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:910)
at salvob.SparkConsumer$$anonfun$main$1.apply(SparkConsumer.scala:200)
at salvob.SparkConsumer$$anonfun$main$1.apply(SparkConsumer.scala:132)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: org.apache.commons.configuration.PropertiesConfiguration
Serialization stack:
- object not serializable (class: org.apache.commons.configuration.PropertiesConfiguration, value: org.apache.commons.configuration.PropertiesConfiguration@2cef9ce8)
- field (class: salvob.TitanModule, name: configuration, type: class org.apache.commons.configuration.PropertiesConfiguration)
- object (class salvob.TitanModule, salvob.TitanModule@20d984db)
- field (class: salvob.SparkConsumer$$anonfun$main$1$$anonfun$apply$3, name: conn$1, type: class salvob.TitanModule)
- object (class salvob.SparkConsumer$$anonfun$main$1$$anonfun$apply$3, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 28 more
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:911)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:910)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:910)
at salvob.SparkConsumer$$anonfun$main$1.apply(SparkConsumer.scala:200)
at salvob.SparkConsumer$$anonfun$main$1.apply(SparkConsumer.scala:132)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: org.apache.commons.configuration.PropertiesConfiguration
Serialization stack:
- object not serializable (class: org.apache.commons.configuration.PropertiesConfiguration, value: org.apache.commons.configuration.PropertiesConfiguration@2cef9ce8)
- field (class: salvob.TitanModule, name: configuration, type: class org.apache.commons.configuration.PropertiesConfiguration)
- object (class salvob.TitanModule, salvob.TitanModule@20d984db)
- field (class: salvob.SparkConsumer$$anonfun$main$1$$anonfun$apply$3, name: conn$1, type: class salvob.TitanModule)
- object (class salvob.SparkConsumer$$anonfun$main$1$$anonfun$apply$3, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 28 more
答案 0 :(得分:1)
Spark Streaming生成RDD。 RDD内部数据的处理发生在工作节点上。您在rdd.map()中编写的代码将与该块中引用的对象一起序列化,并发送到工作节点进行处理。
通过Spark使用图形实例的理想方法如下:
streamRdd.map(kafkaTuple => {
// create graph instance
// use graph instance to add / modify graph
// close graph instance
})
但是这会为每一行创建一个新的图形实例。作为优化,您可以为每个实例创建图形实例
rdd.foreachPartition((rddRows: Iterator[kafkaTuple]) => {
val graph: TitanGraph = // create titan instance
val trans: TitanTransaction = graph.newTransaction()
rddRows.foreach(graphVertex => {
// do graph insertion in the above transaction
})
createVertexTrans.commit()
graph.close()
})
graph.newTransaction()在这里有助于多线程图更新。另外,你会得到锁定例外。
唯一的问题是,根据我到目前为止所读到的内容,没有直接支持多节点更新。从我看到的情况来看,Titan Transaction会在尝试修改顶点时用锁定更新HBase。所以其他分区在尝试进行任何更新时都会失败。您必须构建外部同步机制或将您的rdd重新分区为单个分区,然后使用上面的代码进行更新。
答案 1 :(得分:0)
确保可以传递给其他从属计算机的所有类都是Serializable。这非常重要。不要初始化这些传递的类之外的任何变量。
我使用过Apache Spark(非流媒体)并且运行良好。由于Titan使用的是Spark版本,因此要做到这一点并不容易。所以会有一些依赖冲突。这是唯一可行的版本
s1 <- cspade(trans, parameter = list(support = 0.001,maxlen=3,maxgap=10), control = list(verbose = TRUE,numpart=1))
summary(s1)
s1_df <- as(s1, "data.frame")
r1 <- ruleInduction(s1, confidence = 0.05, control = list(verbose = TRUE))
r1.subset.rule <- subset(r1, rhs(r1) %in% c("9990") & lift>2 & !lhs(r1) %in% c("300","301","412","4033","4043"))
plot(r1.subset.rule,method="graph",control=list(alpha=1))
Error in as.double(y) :
cannot coerce type 'S4' to vector of type 'double'
这就是我启动群集的方式。
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.2</version>
</dependency>
然后解析数据
SparkConf conf = new SparkConf()
.setAppName(AbstractSparkImporter.class.getCanonicalName())
.setMaster("spark_cluster_name");
this.sc = new JavaSparkContext(conf);
this.numPartitions=new Integer(num);
如果有必要,我可以在Github上发布这个模块。