使用libs,spark-cassandra-connector_2-11.jar和spark-sql-2.4.1.jar
具有如下所示的Cassandra表
CREATE TABLE abc.company_vals(
companyId int,
companyName text,
year int,
quarter text,
revenue int,
PRIMARY KEY (companyId, year)
) WITH CLUSTERING ORDER BY ( year DESC );
尝试使用以下火花结构流将数据插入到上面
List<Row> data = Arrays.asList(
RowFactory.create(10002 , "TCS",2004,"Q4",7800),
RowFactory.create(10003, "GE",2004,"Q4",7800),
RowFactory.create(10004, "Oracle",2004,"Q4",7800),
RowFactory.create(10005, "epam",2004,"Q4",7800),
RowFactory.create(10006, "Dhfl",2004,"Q4",7800),
RowFactory.create(10007, "Infosys",2004,"Q4",7800)
)
StructType schema = new StructType()
.add("companyId", DataTypes.IntegerType)
.add("companyName", DataTypes.StringType)
.add("year", DataTypes.IntegerType)
.add("quarter", DataTypes.StringType)
.add("revenue", DataTypes.IntegerType);
Dataset<Row> companyDf = sparkSession.createDataFrame(data, schema).toDF();
companyDf
.write()
.format("org.apache.spark.sql.cassandra")
.option("table","company_vals")
.option("keyspace", "abc")
.mode(SaveMode.Append)
.save();
我更改了表的顺序,如pk,集群键和其余列,并相应地更改了StructType和输入...但仍然存在相同的错误。
遇到错误:
java.util.NoSuchElementException: Columns not found in table abc.company_vals: companyId, companyName
at com.datastax.spark.connector.SomeColumns.selectFrom(ColumnSelector.scala:44)
at com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:385)
at com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:35)
at org.apache.spark.sql.cassandra.CassandraSourceRelation.insert(CassandraSourceRelation.scala:76)
at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:86)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
我在这里做错了什么?该如何解决?
答案 0 :(得分:3)
问题在于Spark Connector使用区分大小写的名称,而在CQL中,它们不区分大小写,直到将列名称放在双引号中为止。因此,您要么需要声明表中的字段区分大小写,分别为name: Jeremy Counter
phone: (907) 519-6656
company: American Pacific Mortgage
url: jeremycounter.com
和"companyId"
,要么在Spark应用程序中使用小写名称。