I have a Cassandra table with three columns devid, epoch, dimension. Now for analytics purposes over Spark, I wanted to have all data of a specific devid to go to the same node irrespective of dimension and epoch so that there is good data locality and for analytics of a single devid, I can avoid network data shuffling in Spark.
However the amount of data for each devid will be too huge to be efficient in a single partition. Hence I cannot define a primary key like (devid, dimension, epoch). So I need to go for a key like ((devid, dimension), epoch) which will be manageable. However this will start putting data of a single devid on multiple nodes (and then Spark will require to do data shuffling over network for analytics over a single devid).
Can I create a custom parititioner which will consider devid and ignore the dimension in the key ((devid, dimension), epoch) for generating the paritition token? Is it advisable to do so?
答案 0 :(得分:0)
不确定您要执行的操作,但是听起来您打算有多个分区,但是迫使它们仍然驻留在同一节点上?...除非您的复制因子为1,否则您仍然会放无论如何,我都不知道为什么要在多个节点上存储数据?
您看过Spark cassandra连接器还是其他类似的东西?