通过Spark数据将Dataframe写入Cassandra:java.lang.NullPointerException:参数值不能为null

时间:2020-01-27 00:27:50

标签: json dataframe apache-spark pyspark cassandra

我试图将一个简单的DataFrame从我的PySpark编写到一个Cassandra,我已经在Docker容器中运行了。

在我的设置中,我正在使用

将多个JSON文件的内容读取到DataFrame中
df = spark.read.json(path)

...然后我在框架上执行一些非常基本的操作,例如

df = df.withColumn("friends", split(col("friends"), ", ").cast(ArrayType(StringType())))

然后我的DF模式给了我

root
 |-- average_stars: double (nullable = true)
 |-- cool: long (nullable = true)
 |-- elite: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- fans: long (nullable = true)
 |-- friends: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- funny: long (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- yelping_since: timestamp (nullable = true)

而我是通过以下方式创建表的:

CREATE TABLE user (
    user_id varchar PRIMARY KEY,
    name varchar,
    review_count int,
    yelping_since date,
    friends set<text>,
    useful int,
    funny int,
    cool int,
    fans int,
    elite set<int>,
    average_stars float
);

可以读取多个JSON文件,但这只是给我带来麻烦的文件之一。

通话

df.write \
        .format("org.apache.spark.sql.cassandra") \
        .mode(mode) \
        .options(table=table, keyspace=keyspace) \
        .save()

产生错误:

Caused by: java.lang.NullPointerException: Parameter value cannot be null
    at shade.com.datastax.spark.connector.google.common.base.Preconditions.checkNotNull(Preconditions.java:226)
    at com.datastax.driver.core.CodecRegistry.findCodec(CodecRegistry.java:511)
    at com.datastax.driver.core.CodecRegistry.maybeCreateCodec(CodecRegistry.java:630)
    at com.datastax.driver.core.CodecRegistry.createCodec(CodecRegistry.java:538)
    at com.datastax.driver.core.CodecRegistry.findCodec(CodecRegistry.java:520)
    at com.datastax.driver.core.CodecRegistry.codecFor(CodecRegistry.java:470)
    at com.datastax.spark.connector.util.CodecRegistryUtil$.codecFor(CodecRegistryUtil.scala:8)
    at com.datastax.spark.connector.writer.BoundStatementBuilder.com$datastax$spark$connector$writer$BoundStatementBuilder$$bindColumnUnset(BoundStatementBuilder.scala:72)
    at com.datastax.spark.connector.writer.BoundStatementBuilder$$anonfun$6.apply(BoundStatementBuilder.scala:84)
    at com.datastax.spark.connector.writer.BoundStatementBuilder$$anonfun$6.apply(BoundStatementBuilder.scala:84)
    at com.datastax.spark.connector.writer.BoundStatementBuilder$$anonfun$bind$1.apply$mcVI$sp(BoundStatementBuilder.scala:106)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
    at com.datastax.spark.connector.writer.BoundStatementBuilder.bind(BoundStatementBuilder.scala:101)
    at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
    at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
    at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:233)
    at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:210)
    at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
    at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
    at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
    at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
    at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:210)
    at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:197)
    at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:183)
    at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
    at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

我完全不知所措,在这里我忽略的一切-与所有其他文件一样很好。 当错误消息显示为Parameter value cannot be null时,我想我确实忽略了某种形式的配置-但是,如何将我所有其他数据帧写入Cassandra很好呢?

非常感谢您的帮助:-)

0 个答案:

没有答案