Question

我有以下Cassandra表：

CREATE TABLE listener.snapshots_geohash 
(
    created_date text, -- date when record have come to the system
    geo_part text, -- few signs of geo hash - just for partitioning
    when timestamp, -- record creation date
    device_id text, -- id of device produced json data (see snapshot column)
    snapshot text, -- json data, should be aggregated by spark
    PRIMARY KEY ((created_date, geo_part), when, device_id)
)

每天早上聚合应用程序应该加载前一天的数据并从快照列聚合JSON数据。聚合将按geohash对数据进行分组，这就是选择其部分作为分区键的一部分的原因。

我知道使用joinWithCassandraTable从Cassandra加载数据是有效的 - 但为此我必须从（created_date，geo_part）对构造RDD。虽然我知道created_date值，但我无法列出geo_part值 - 因为它只是geohash的一部分，并且它的值不是连续的。所以我以某种方式运行select distinct created_date, geo_part from ks.snapshots并从结果中创建RDD。问题是如何使用spark 2.0.2和cassandra-connector 2.0.0-M3运行此选择，或者可能有其他方式？

Answer 1

我找到了通过使用CassandraConnector运行CQL查询来获取Cassandra分区键的方法：

 val cassandraConnector = CassandraConnector(spark.sparkContext.getConf)
 val distinctRows = cassandraConnector.withSessionDo(session => {
     session.execute(s"select distinct created_date, geo_part from ${keyspace}.$snapshots_table")
 }).all().map(row => {TableKeyM(row.getString("created_date"), row.getString("geo_part"))}).filter(k => {days.contains(k.created_date)})
 val data_x = spark.sparkContext.parallelize(distinctRows)

表结构设计存在以下问题：Cassandra不允许将 WHERE created_date ='...'子句添加到选择不同的created_date，geo_part ，并且需要获取完整的对列表并在应用程序中对其进行过滤。

替代解决方案可能是使分区键连续。如果聚合将按小时完成 - 那么分区键可以是（created_date，hour），并且可以在应用程序中列出24小时。如果每天24个分区是不够的，并且聚合通过geohash 分组，则可以坚持geohash重要部分 - 但它应该被转换为可数的东西 - 例如 geoPart。 hash（）％desiredNumberOfSubpartitions

Answer 2

val keys = sc.cassandraTable("listener","snapshots_geohash").select("created_date","geo_part").perPartitionLimit(1)

有关详细说明，请参见https://stackoverflow.com/a/56269424/17324。

分区键检索用于joinWithCassandraTable

2 个答案: