使用Spark(或SparkStreaming)在TitanDB上插入数据

时间:2016-05-30 16:41:28

标签: spark-streaming titan serializable

我正在尝试使用SparkStreaming(从Kafka队列收集消息)向TitanDB添加元素。但它似乎比预期更难。 这里是Titan连接的定义:

val confPath: String = "titan-cassandra-es-spark.properties"
val conn: TitanModule = new TitanModule(confPath) 

Titan模块是一个Serializable类,用于配置TitanDB连接:

...
val configurationFilePath: String = confFilePath
val configuration = new PropertiesConfiguration(configurationFilePath)
val gConn: TitanGraph = TitanFactory.open(configuration)
...

当我执行从Kafka队列收集消息(json)的sparkStreaming作业时,它会收到消息并尝试将其添加到TitanDB中,它会使用以下stackTrace进行爆炸。

你们知道使用SparkStreaming将数据添加到TitanDB是否可行? 你知道这可能是什么解决方案吗?

18:03:50,596 ERROR JobScheduler:95 - Error running job streaming job 1464624230000 ms.0
org.apache.spark.SparkException: Task not serializable
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
        at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
        at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
        at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:911)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:910)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
        at org.apache.spark.rdd.RDD.foreach(RDD.scala:910)
        at salvob.SparkConsumer$$anonfun$main$1.apply(SparkConsumer.scala:200)
        at salvob.SparkConsumer$$anonfun$main$1.apply(SparkConsumer.scala:132)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
        at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
        at scala.util.Try$.apply(Try.scala:161)
        at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: org.apache.commons.configuration.PropertiesConfiguration
Serialization stack:
        - object not serializable (class: org.apache.commons.configuration.PropertiesConfiguration, value: org.apache.commons.configuration.PropertiesConfiguration@2cef9ce8)
        - field (class: salvob.TitanModule, name: configuration, type: class org.apache.commons.configuration.PropertiesConfiguration)
        - object (class salvob.TitanModule, salvob.TitanModule@20d984db)
        - field (class: salvob.SparkConsumer$$anonfun$main$1$$anonfun$apply$3, name: conn$1, type: class salvob.TitanModule)
        - object (class salvob.SparkConsumer$$anonfun$main$1$$anonfun$apply$3, <function1>)
        at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
        at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
        at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
        ... 28 more
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
        at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
        at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
        at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:911)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:910)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
        at org.apache.spark.rdd.RDD.foreach(RDD.scala:910)
        at salvob.SparkConsumer$$anonfun$main$1.apply(SparkConsumer.scala:200)
        at salvob.SparkConsumer$$anonfun$main$1.apply(SparkConsumer.scala:132)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
        at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
        at scala.util.Try$.apply(Try.scala:161)
        at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: org.apache.commons.configuration.PropertiesConfiguration
Serialization stack:
        - object not serializable (class: org.apache.commons.configuration.PropertiesConfiguration, value: org.apache.commons.configuration.PropertiesConfiguration@2cef9ce8)
        - field (class: salvob.TitanModule, name: configuration, type: class org.apache.commons.configuration.PropertiesConfiguration)
        - object (class salvob.TitanModule, salvob.TitanModule@20d984db)
        - field (class: salvob.SparkConsumer$$anonfun$main$1$$anonfun$apply$3, name: conn$1, type: class salvob.TitanModule)
        - object (class salvob.SparkConsumer$$anonfun$main$1$$anonfun$apply$3, <function1>)
        at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
        at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
        at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
        ... 28 more

2 个答案:

答案 0 :(得分:1)

Spark Streaming生成RDD。 RDD内部数据的处理发生在工作节点上。您在rdd.map()中编写的代码将与该块中引用的对象一起序列化,并发送到工作节点进行处理。

通过Spark使用图形实例的理想方法如下:

streamRdd.map(kafkaTuple => {
  // create graph instance
  // use graph instance to add / modify graph
  // close graph instance
})

但是这会为每一行创建一个新的图形实例。作为优化,您可以为每个实例创建图形实例

rdd.foreachPartition((rddRows: Iterator[kafkaTuple]) => {
      val graph: TitanGraph = // create titan instance
      val trans: TitanTransaction = graph.newTransaction()

      rddRows.foreach(graphVertex => {
        // do graph insertion in the above transaction
      })

      createVertexTrans.commit()
      graph.close()
})

graph.newTransaction()在这里有助于多线程图更新。另外,你会得到锁定例外。

唯一的问题是,根据我到目前为止所读到的内容,没有直接支持多节点更新。从我看到的情况来看,Titan Transaction会在尝试修改顶点时用锁定更新HBase。所以其他分区在尝试进行任何更新时都会失败。您必须构建外部同步机制或将您的rdd重新分区为单个分区,然后使用上面的代码进行更新。

答案 1 :(得分:0)

确保可以传递给其他从属计算机的所有类都是Serializable。这非常重要。不要初始化这些传递的类之外的任何变量。

我使用过Apache Spark(非流媒体)并且运行良好。由于Titan使用的是Spark版本,因此要做到这一点并不容易。所以会有一些依赖冲突。这是唯一可行的版本

s1                <- cspade(trans, parameter = list(support = 0.001,maxlen=3,maxgap=10), control = list(verbose = TRUE,numpart=1))
summary(s1)
s1_df             <- as(s1, "data.frame")
r1                <- ruleInduction(s1, confidence = 0.05, control = list(verbose = TRUE))
r1.subset.rule    <- subset(r1, rhs(r1) %in% c("9990") & lift>2 & !lhs(r1) %in% c("300","301","412","4033","4043"))
plot(r1.subset.rule,method="graph",control=list(alpha=1))

Error in as.double(y) : 
  cannot coerce type 'S4' to vector of type 'double'

这就是我启动群集的方式。

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>1.2.2</version>
</dependency>

然后解析数据

SparkConf conf = new SparkConf()
                .setAppName(AbstractSparkImporter.class.getCanonicalName())
                .setMaster("spark_cluster_name");
this.sc = new JavaSparkContext(conf);
this.numPartitions=new Integer(num);

如果有必要,我可以在Github上发布这个模块。