通过Spark RDD将一个SimpleFeature写入Cassandra

时间:2018-02-12 13:50:05

标签: apache-spark geomesa

我想知道是否可以在Spark上下文中向Cassandra编写SimpleFeature?我试图将我的数据的SimpleFeatures映射到Spark RDD,但我遇到了一些问题。正在调用的以下createFeature()函数在独立单元测试中工作正常,我有另一个单元测试调用它,并通过GeoMesa api成功写入Cassandra并使用它生成的SimpleFeature:

import org.locationtech.geomesa.spark.GeoMesaSparkKryoRegistrator

. . .

private val sparkConf = new SparkConf(true).set("spark.cassandra.connection.host","localhost").set("spark.serializer","org.apache.spark.serializer.KryoSerializer").set("spark.kryo.registrator",classOf[GeoMesaSparkKryoRegistrator].getName).setAppName(appName).setMaster(master)

. . .                                            

val rowsRDD = processedRDD.map(r => {

...

println("** NAME VALUE MAP **")

for ((k,v) <- featureNamesValues) printf("key: %s, value: %s\n", k, v)

val feature = MyGeoMesaManager.createFeature(featureTypeConfig.asJava,featureNamesValues.asJava)
feature
})

rowsRDD.print()

但是,由于Spark分区,我现在在Spark上下文中的RDD的map()函数内部进行函数调用会导致SimpleFeatureImpl上的序列化错误:

18/02/12 08:00:46 ERROR Executor: Exception in task 0.0 in stage 19.0 (TID 
9)
java.io.NotSerializableException: org.geotools.feature.simple.SimpleFeatureImpl
Serialization stack:
- object not serializable (class: org.geotools.feature.simple.SimpleFeatureImpl, value: SimpleFeatureImpl:myfeature=[SimpleFeatureImpl.Attribute: . . ., SimpleFeatureImpl.Attribute: . . .])
- element of array (index: 0)
- array (class [Lorg.opengis.feature.simple.SimpleFeature;, size 4)

好吧,然后我添加了在geomesa spark核心页面上提到的kyro依赖项以减轻这种情况,但是现在我在执行map函数时在GeoMesaSparkKryoRegistrator类上得到NoClassDefFoundError,但正如你可以看到几何-path-core依赖存在于类路径上,我可以导入类:

18/02/12 08:08:37 ERROR Executor: Exception in task 0.0 in stage 26.0 (TID 
11)
java.lang.NoClassDefFoundError: Could not initialize class org.locationtech.geomesa.spark.GeoMesaSparkKryoRegistrator$
at org.locationtech.geomesa.spark.GeoMesaSparkKryoRegistrator$$anon$1.write(GeoMesaSparkKryoRegistrator.scala:36)
at org.locationtech.geomesa.spark.GeoMesaSparkKryoRegistrator$$anon$1.write(GeoMesaSparkKryoRegistrator.scala:32)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

最后,我尝试将com.esotericsoftware.kryo依赖项添加到类路径中,但是我遇到了同样的错误。

是否有可能做我想用GeoMesa,Spark和Cassandra做的事情?感觉就像我在1码线上,但我不能打它。

1 个答案:

答案 0 :(得分:1)

设置类路径的最简单方法是将maven与maven shade插件一起使用。添加对geomesa-cassandra-datastore和geomesa-spark-geotools模块的依赖:

<dependency>
  <groupId>org.locationtech.geomesa</groupId>
  <artifactId>geomesa-cassandra-datastore_2.11</artifactId>
</dependency>
<dependency>
  <groupId>org.locationtech.geomesa</groupId>
  <artifactId>geomesa-spark-geotools_2.11</artifactId>
</dependency>

然后添加一个maven shade插件,类似于使用here用于Accumulo的插件。使用阴影jar提交你的spark工作,类路径应该包含所有必需的东西。