应用错误收集

I have a Cassandra table with three columns devid, epoch, dimension. Now for analytics purposes over Spark, I wanted to have all data of a specific devid to go to the same node irrespective of dimension and epoch so that there is good data locality and for analytics of a single devid, I can avoid network data shuffling in Spark.

However the amount of data for each devid will be too huge to be efficient in a single partition. Hence I cannot define a primary key like (devid, dimension, epoch). So I need to go for a key like ((devid, dimension), epoch) which will be manageable. However this will start putting data of a single devid on multiple nodes (and then Spark will require to do data shuffling over network for analytics over a single devid).

Can I create a custom parititioner which will consider devid and ignore the dimension in the key ((devid, dimension), epoch) for generating the paritition token? Is it advisable to do so?

Custom partitioner in Cassandra

1 个答案: