我正在一个项目中将Spark数据帧保存到cassandra,并且遇到有关行大小无效的异常(请参见下文)。我试图跟踪连接器中的代码,但似乎行大小(下面的3)与列数(原来是1)不同。我正在尝试遵循https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md中的示例,其中客户具有两个字段,而不仅仅是示例中提到的ID。我尝试搜索,但没有找到任何解决方法。
我正在使用Spark 2.3.0和spark-cassandra-connector:2.3.0-s_2.11。
只有一点背景知识-我尝试将数据框保存到cassandra,并且可以正常工作。但是,它非常慢。所以我想看看是否使用repartitionByCassandraReplica可以使其更快。我已经尝试了数据帧上批处理行大小,并发写入器等的各种组合,但它仍然很慢。因此,我在尝试将repartitionByCassandraReplica保存到cassandra表之前。如果还有其他方法可以更快地将数据帧保存到cassandra,请告诉我。 (如果删除repartitionByCassandraReplica,则可以将数据保存到cassandra。)
这是我的情况:
键空间测试中的Cassandra表-
create table customer ( customer_id text primary key, order integer, value integer);
Spark shell命令:
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
case class Customer(customer_id:String, order_id:Int, value:Int)
val customers = Seq(Customer("1",1,1),Customer("2",2,2)).toDF("customer_id","order_id","value")
val customersRdd = customers.rdd.repartitionByCassandraReplica("test","customers")
customersRdd.saveToCassandra("test","customer")
这时我得到一个例外:
java.lang.IllegalArgumentException: requirement failed: Invalid row size: 3 instead of 1.
at scala.Predef$.require(Predef.scala:224)
at com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:23)
at com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:12)
at com.datastax.spark.connector.writer.BoundStatementBuilder.bind(BoundStatementBuilder.scala:99)
at com.datastax.spark.connector.rdd.partitioner.TokenGenerator.getPartitionKeyBufferFor(TokenGenerator.scala:38)
at com.datastax.spark.connector.rdd.partitioner.ReplicaPartitioner.getPartition(ReplicaPartitioner.scala:70)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
18/07/18 10:27:51 ERROR Executor: Exception in task 1.0 in stage 6.0 (TID 4)
java.lang.IllegalArgumentException: requirement failed: Invalid row size: 3 instead of 1.
感谢您的帮助。