Custom partitioner in Cassandra

时间:2019-02-15 15:02:05

标签: cassandra

I have a Cassandra table with three columns devid, epoch, dimension. Now for analytics purposes over Spark, I wanted to have all data of a specific devid to go to the same node irrespective of dimension and epoch so that there is good data locality and for analytics of a single devid, I can avoid network data shuffling in Spark.

However the amount of data for each devid will be too huge to be efficient in a single partition. Hence I cannot define a primary key like (devid, dimension, epoch). So I need to go for a key like ((devid, dimension), epoch) which will be manageable. However this will start putting data of a single devid on multiple nodes (and then Spark will require to do data shuffling over network for analytics over a single devid).

Can I create a custom parititioner which will consider devid and ignore the dimension in the key ((devid, dimension), epoch) for generating the paritition token? Is it advisable to do so?

1 个答案:

答案 0 :(得分:0)

不确定您要执行的操作,但是听起来您打算有多个分区,但是迫使它们仍然驻留在同一节点上?...除非您的复制因子为1,否则您仍然会放无论如何,我都不知道为什么要在多个节点上存储数据?

您看过Spark cassandra连接器还是其他类似的东西?

这也可能有用: https://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/TokenAwarePolicy.html