我使用Spark 2.0和Scala 2.11.8。
我有一个来自select查询的Cassandra ResultSet,我想将它转换为Spark DataFrame或Dataset。怎么样?
我一直在尝试使用此连接器:
"com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0-RC1"
以及后来的这个:
"com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0-M3"
代码:
import com.datastax.spark.connector._
val sparkConf = new SparkConf().
setAppName(appName).
set("spark.cassandra.connection.host", "10.60.50.134").
set("spark.cassandra.auth.username", "xyz").
set("spark.cassandra.auth.password", "abc")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val rdd = spark.
sparkContext.
cassandraTable(keyspace = s"$keyspace", table = s"$table")
rdd.take(10).foreach(println)
两者都出现以下错误:
Exception in thread "main" java.lang.NoSuchMethodError: com.datastax.driver.core.KeyspaceMetadata.getMaterializedViews()Ljava/util/Collection;
at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchTables$1(Schema.scala:281)
at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:305)
at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:304)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:683)
at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:682)
at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1(Schema.scala:304)
at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:325)
at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:322)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:122)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:121)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:140)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:121)
at com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:322)
at com.datastax.spark.connector.cql.Schema$.tableFromCassandra(Schema.scala:342)
at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.tableDef(CassandraTableRowReaderProvider.scala:50)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef$lzycompute(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.verify(CassandraTableRowReaderProvider.scala:137)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.verify(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.getPartitions(CassandraTableScanRDD.scala:232)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1297)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.take(RDD.scala:1292)
at com.datastax.spark.connector.rdd.CassandraRDD.take(CassandraRDD.scala:121)
at com.datastax.spark.connector.rdd.CassandraRDD.take(CassandraRDD.scala:122)
答案 0 :(得分:1)
您似乎正在使用Spark Cassandra Connector的数据集前API,因为它支持开箱即用的数据集(但可能需要不同的方式从Cassandra表加载数据)。
我的建议是重新编写/升级您的代码以使用Spark Cassandra Connector' Dataset-friendly API。
来自Example Changing Cluster/Keyspace Level Properties:
val df = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "words", "keyspace" -> "test"))
.load() // This Dataset will use a spark.cassandra.input.size of 128
后来在Creating Datasets using Read Commands(突出我的):
创建数据集的最具编程性的方法是在
read
上调用SparkSession
命令。这将构建一个DataFrameReader
。将format
指定为 org.apache.spark.sql.cassandra 。然后,您可以使用options
提供上述Map[String,String]
个选项的地图。然后通过调用load
来实际获得数据集。这段代码都是懒惰的,在调用动作之前不会实际加载任何数据。
有org.apache.spark.sql.cassandra.CassandraSQLRow个对象似乎提供从com.datastax.driver.core.Row
到org.apache.spark.sql.cassandra.CassandraSQLRow
的转换:
fromJavaDriverRow(row: com.datastax.driver.core.Row, metaData: CassandraRowMetadata): CassandraSQLRow
我对Spark Cassandra Connector的经验有限,建议在必要时使用隐式转换。
// bring all the implicit goodies from the Spark Cassandra Connector
import com.datastax.spark.connector._