In the Cassandra documentation here it says:
While STCS works well to compact a write-intensive workload, it makes reads slower because the merge-by-size process does not group data by rows. This makes it more likely that versions of a particular row may be spread over many SSTables.
1) What does 'group data by rows' mean? Aren't all rows for a partition already grouped?
2) How is it possible for a row to have multiple versions on a single node? Doesn't the upsert behavior ensure that only the latest version of a row is accessible via the memtable and partition indices? Isn't it true that when a row is updated and the memtable flushed, the partition indices are updated to point to the latest version? Then, on compaction, this latest version (because of the row timestamp) is the one that ends up in the compacted SSTable?
Note that I'm talking about a single node here - NOT the issue of replicas being out of sync.
Either this is incorrect or I am misunderstanding what that paragraph says.
Thanks!
答案 0 :(得分:0)
OK, I think I found the answer myself - I would be grateful for any confirmation that this is correct.
A row may have many versions because updates/upserts can write only part of a row. Thus, the latest version of a complete row is made up of all the latest updates for all the columns in that row - which can be spread out across multiple SSTables.
My misunderstanding seemed to stem from the idea that the partition indices can only point to one location in one SSTable. If I relax this constraint, the statement in the doc makes sense. I must therefore assume that an index in the partition indices for a primary key can hold multiple locations for that key. Can someone confirm that all this is true?
Thanks.