Spark CassandraTableScanRDD KeyBy不保留所有列

时间:2017-09-22 17:54:17

标签: apache-spark rdd spark-cassandra-connector

string1 1 true
string2 2 true
string1 1 true

这个cassandraRdd包含从我的cassandra表中读取的所有列

CASSANDRA_TABLE has (some_other_column, itemid) as primary key.

val cassandraRdd: CassandraTableScanRDD[CassandraRow] = sparkSession.sparkContext
  .cassandraTable(cassandraKeyspace, cassandraTable)

cassandraRdd.take(10).foreach(println)

在keyBy操作

之后,temp1和temp2都没有保留所有列
val temp1: CassandraTableScanRDD[((String), CassandraRow)] = cassandraRdd
  .select("itemid", "column2", "column3")
  .keyBy[(String)]("itemid")
val temp2: CassandraTableScanRDD[((String), CassandraRow)] = cassandraRdd
  .keyBy[(String)]("itemid")
temp1.take(10).foreach(println)
temp2.take(10).foreach(println)

如何在特定列上键入key并让CassandraRow保留所有列?

1 个答案:

答案 0 :(得分:0)

要保留分区并获取选定的行,我必须阅读cassandra行,如下所示

val cassandraRdd: CassandraTableScanRDD[((String, String), (String, String, String))] = {
  sparkSession.sparkContext
    .cassandraTable[(String, String, String)](cassandraKeyspace, cassandraTable)
    .select("some_other_column" as "_1", "itemid" as "_2", "column3" as "_3", "some_other_column", "itemid")
    .keyBy[(String, String)]("some_other_column", "itemid")
}