如何将Iterable <com.datastax.driver.core.row>转换为Dataset?

时间:2017-06-08 06:09:05

标签: apache-spark apache-spark-sql spark-cassandra-connector

我使用Spark 2.0和Scala 2.11.8。

我有一个来自select查询的Cassandra ResultSet,我想将它转换为Spark DataFrame或Dataset。怎么样?

我一直在尝试使用此连接器:

"com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0-RC1"

以及后来的这个:

"com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0-M3"

代码:

import com.datastax.spark.connector._

val sparkConf = new SparkConf().
  setAppName(appName).
  set("spark.cassandra.connection.host", "10.60.50.134").
  set("spark.cassandra.auth.username", "xyz").
  set("spark.cassandra.auth.password", "abc")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val rdd = spark.
  sparkContext.
  cassandraTable(keyspace = s"$keyspace", table = s"$table")
rdd.take(10).foreach(println)

两者都出现以下错误:

Exception in thread "main" java.lang.NoSuchMethodError: com.datastax.driver.core.KeyspaceMetadata.getMaterializedViews()Ljava/util/Collection;
    at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchTables$1(Schema.scala:281)
    at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:305)
    at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:304)
    at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:683)
    at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
    at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
    at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:682)
    at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1(Schema.scala:304)
    at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:325)
    at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:322)
    at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:122)
    at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:121)
    at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
    at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
    at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:140)
    at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:110)
    at com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:121)
    at com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:322)
    at com.datastax.spark.connector.cql.Schema$.tableFromCassandra(Schema.scala:342)
    at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.tableDef(CassandraTableRowReaderProvider.scala:50)
    at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef$lzycompute(CassandraTableScanRDD.scala:60)
    at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef(CassandraTableScanRDD.scala:60)
    at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.verify(CassandraTableRowReaderProvider.scala:137)
    at com.datastax.spark.connector.rdd.CassandraTableScanRDD.verify(CassandraTableScanRDD.scala:60)
    at com.datastax.spark.connector.rdd.CassandraTableScanRDD.getPartitions(CassandraTableScanRDD.scala:232)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
    at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1297)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
    at org.apache.spark.rdd.RDD.take(RDD.scala:1292)
    at com.datastax.spark.connector.rdd.CassandraRDD.take(CassandraRDD.scala:121)
    at com.datastax.spark.connector.rdd.CassandraRDD.take(CassandraRDD.scala:122)

1 个答案:

答案 0 :(得分:1)

您似乎正在使用Spark Cassandra Connector的数据集前API,因为它支持开箱即用的数据集(但可能需要不同的方式从Cassandra表加载数据)。

我的建议是重新编写/升级您的代码以使用Spark Cassandra Connector&#39; Dataset-friendly API

来自Example Changing Cluster/Keyspace Level Properties

val df = spark
  .read
  .format("org.apache.spark.sql.cassandra")
  .options(Map( "table" -> "words", "keyspace" -> "test"))
  .load() // This Dataset will use a spark.cassandra.input.size of 128

后来在Creating Datasets using Read Commands(突出我的):

  

创建数据集的最具编程性的方法是在read上调用SparkSession命令。这将构建一个DataFrameReader。将format指定为 org.apache.spark.sql.cassandra 。然后,您可以使用options提供上述Map[String,String]个选项的地图。然后通过调用load来实际获得数据集。这段代码都是懒惰的,在调用动作之前不会实际加载任何数据。

org.apache.spark.sql.cassandra.CassandraSQLRow个对象似乎提供从com.datastax.driver.core.Roworg.apache.spark.sql.cassandra.CassandraSQLRow的转换:

fromJavaDriverRow(row: com.datastax.driver.core.Row, metaData: CassandraRowMetadata): CassandraSQLRow

我对Spark Cassandra Connector的经验有限,建议在必要时使用隐式转换。

// bring all the implicit goodies from the Spark Cassandra Connector
import com.datastax.spark.connector._