重新分区后保存时,Cassandra-Spark连接器IllegalArgumentExceptionByCassandraReplica

时间:2018-07-22 21:03:09

标签: apache-spark cassandra save illegalargumentexception

我正在一个项目中将Spark数据帧保存到cassandra,并且遇到有关行大小无效的异常(请参见下文)。我试图跟踪连接器中的代码,但似乎行大小(下面的3)与列数(原来是1)不同。我正在尝试遵循https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md中的示例,其中客户具有两个字段,而不仅仅是示例中提到的ID。我尝试搜索,但没有找到任何解决方法。

我正在使用Spark 2.3.0和spark-cassandra-connector:2.3.0-s_2.11。

只有一点背景知识-我尝试将数据框保存到cassandra,并且可以正常工作。但是,它非常慢。所以我想看看是否使用repartitionByCassandraReplica可以使其更快。我已经尝试了数据帧上批处理行大小,并发写入器等的各种组合,但它仍然很慢。因此,我在尝试将repartitionByCassandraReplica保存到cassandra表之前。如果还有其他方法可以更快地将数据帧保存到cassandra,请告诉我。 (如果删除repartitionByCassandraReplica,则可以将数据保存到cassandra。)

这是我的情况:

键空间测试中的Cassandra表-

create table customer ( customer_id text primary key, order integer, value integer);

Spark shell命令:

import com.datastax.spark.connector._

import org.apache.spark.sql.cassandra._

case class Customer(customer_id:String, order_id:Int, value:Int)

val customers = Seq(Customer("1",1,1),Customer("2",2,2)).toDF("customer_id","order_id","value")

val customersRdd = customers.rdd.repartitionByCassandraReplica("test","customers")

customersRdd.saveToCassandra("test","customer")

这时我得到一个例外:

java.lang.IllegalArgumentException: requirement failed: Invalid row size: 3 instead of 1. at scala.Predef$.require(Predef.scala:224) at com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:23) at com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:12) at com.datastax.spark.connector.writer.BoundStatementBuilder.bind(BoundStatementBuilder.scala:99) at com.datastax.spark.connector.rdd.partitioner.TokenGenerator.getPartitionKeyBufferFor(TokenGenerator.scala:38) at com.datastax.spark.connector.rdd.partitioner.ReplicaPartitioner.getPartition(ReplicaPartitioner.scala:70) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 18/07/18 10:27:51 ERROR Executor: Exception in task 1.0 in stage 6.0 (TID 4) java.lang.IllegalArgumentException: requirement failed: Invalid row size: 3 instead of 1.

感谢您的帮助。

0 个答案:

没有答案