无法使用pyspark

时间:2019-06-17 14:17:36

标签: pyspark amazon-emr cassandra-3.0

我正在尝试使用pyspark将数据帧写入cassandra,但它给我带来了一个错误:

  

py4j.protocol.Py4JJavaError:调用o74.save时发生错误。   :org.apache.spark.SparkException:由于阶段失败,作业中止了:   3.0阶段的任务6失败4次,最近一次失败:任务6.3丢失   在阶段3.0(TID 24,ip-172-31-11-193.us-west-2.compute.internal,   执行程序1):java.lang.NoClassDefFoundError:   com / twitter / jsr166e / LongAdder           在org.apache.spark.metrics.OutputMetricsUpdater $ TaskMetricsSupport $ class。$ init $(OutputMetricsUpdater.scala:107)处           在org.apache.spark.metrics.OutputMetricsUpdater $ TaskMetricsUpdater。处(OutputMetricsUpdater.scala:153)           在org.apache.spark.metrics.OutputMetricsUpdater $ .apply(OutputMetricsUpdater.scala:75)           在com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:209)           在com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:197)           在com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:183)           在com.datastax.spark.connector.RDDFunctions $$ anonfun $ saveToCassandra $ 1.apply(RDDFunctions.scala:36)           在com.datastax.spark.connector.RDDFunctions $$ anonfun $ saveToCassandra $ 1.apply(RDDFunctions.scala:36)           在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)           在org.apache.spark.scheduler.Task.run(Task.scala:121)           在org.apache.spark.executor.Executor $ TaskRunner $$ anonfun $ 10.apply(Executor.scala:408)           在org.apache.spark.util.Utils $ .tryWithSafeFinally(Utils.scala:1360)           在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:414)           在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)           在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)           在java.lang.Thread.run(Thread.java:748)

下面是我的写代码:

DataFrame.write.format(
   "org.apache.spark.sql.cassandra"
).mode(
   'append'
).options(
   table="student1", 
   keyspace="university"
).save()

我在spark-default.conf中添加了以下提到的spark-caasandra连接器

spark.jars.packages数据包:spark-cassandra-connector:2.4.0-s_2.11

我能够从cassandra中读取数据,但是写的问题。

1 个答案:

答案 0 :(得分:0)

我不是Spark专家,但这可能会有所帮助:

  

这些错误通常在Spark Cassandra Connector或   它的依赖项不在Spark的运行时类路径上   应用。这通常是由于未按规定使用   --packages方法,将Spark Cassandra连接器及其依赖项添加到运行时类路径。

来源: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#why-cant-the-spark-job-find-spark-cassandra-connector-classes-classnotfound-exceptions-for-scc-classes