我是Cassandra的新手并试图弄清楚尺寸如何起作用。我创建了一个键空间和一个表。然后我生成了一个脚本,在java中创建了100万行到csv文件中并将其插入到我的数据库中。 CSV文件大小约为545 mb。然后我将其加载到数据库并运行 nodetool cfstats 命令并接收此输出。它表示使用的总空间为50555052字节(~50 mb)。怎么会这样?随着索引,列等的开销,我的总数据如何小于原始CSV数据(不仅更小,而且更小)?也许我没有在这里正确阅读,但这看起来是对的吗?我在一台机器上使用Cassandra 2.2.1。
Table: users
SSTable count: 1
Space used (live): 50555052
Space used (total): 50555052
Space used by snapshots (total): 0
Off heap memory used (total): 1481050
SSTable Compression Ratio: 0.03029072054256705
Number of keys (estimate): 984133
Memtable cell count: 240336
Memtable data size: 18385704
Memtable off heap memory used: 0
Memtable switch count: 19
Local read count: 0
Local read latency: NaN ms
Local write count: 1000000
Local write latency: 0.044 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 1192632
Bloom filter off heap memory used: 1192624
Index summary off heap memory used: 203778
Compression metadata off heap memory used: 84648
Compacted partition minimum bytes: 643
Compacted partition maximum bytes: 770
Compacted partition mean bytes: 770
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0
生成CSV文件的我的Java代码如下所示:
try{
FileWriter writer = new FileWriter(sFileName);
for(int i=0;i<1000000;i++){
writer.append("Username " + i);
writer.append(',');
writer.append(new Timestamp(date.getTime()).toString());
writer.append(',');
writer.append("myfakeemailaccnt@email.com");
writer.append(',');
writer.append(new Timestamp(date.getTime()).toString());
writer.append(',');
writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
writer.append(',');
writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
writer.append(',');
writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
writer.append(',');
writer.append("tr");
writer.append('\n');
}
writer.flush();
writer.close();
}
catch(IOException e)
{
e.printStackTrace();
}
答案 0 :(得分:1)
所以我想到了最大的3个数据:
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ
并且认为他们是相同的也许Cassandra正在压缩他们,即使它说它只是3%的比例。所以我改变了我的Java代码以生成不同的数据。
public class Main {
private static final String ALPHA_NUMERIC_STRING = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
public static void main(String[] args) {
generateCassandraCSVData("users.csv");
}
public static String randomAlphaNumeric(int count) {
StringBuilder builder = new StringBuilder();
while (count-- != 0) {
int character = (int)(Math.random()*ALPHA_NUMERIC_STRING.length());
builder.append(ALPHA_NUMERIC_STRING.charAt(character));
}
return builder.toString();
}
public static void generateCassandraCSVData(String sFileName){
java.util.Date date= new java.util.Date();
try{
FileWriter writer = new FileWriter(sFileName);
for(int i=0;i<1000000;i++){
writer.append("Username " + i);
writer.append(',');
writer.append(new Timestamp(date.getTime()).toString());
writer.append(',');
writer.append("myfakeemailaccnt@email.com");
writer.append(',');
writer.append(new Timestamp(date.getTime()).toString());
writer.append(',');
writer.append("" + randomAlphaNumeric(150) + "");
writer.append(',');
writer.append("" + randomAlphaNumeric(150) + "");
writer.append(',');
writer.append("" + randomAlphaNumeric(150) + "");
writer.append(',');
writer.append("tr");
writer.append('\n');
//generate whatever data you want
}
writer.flush();
writer.close();
}
catch(IOException e)
{
e.printStackTrace();
}
}
}
所以现在这三个大列的数据都是随机字符串,不再相同了。这就是现在制作的:
Table: users
SSTable count: 4
Space used (live): 554671040
Space used (total): 554671040
Space used by snapshots (total): 0
Off heap memory used (total): 1886175
SSTable Compression Ratio: 0.6615549506522498
Number of keys (estimate): 1019477
Memtable cell count: 270024
Memtable data size: 20758095
Memtable off heap memory used: 0
Memtable switch count: 25
Local read count: 0
Local read latency: NaN ms
Local write count: 1323546
Local write latency: 0.048 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 1533512
Bloom filter off heap memory used: 1533480
Index summary off heap memory used: 257175
Compression metadata off heap memory used: 95520
Compacted partition minimum bytes: 311
Compacted partition maximum bytes: 770
Compacted partition mean bytes: 686
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0
所以现在CSV文件再次~550mb,我的表现在也是~550mb。那么似乎非关键列数据是相同的(低基数)Cassandra它以某种方式非常有效地压缩这些数据?如果是这种情况,那么这是一个非常重要的概念(我以前从未读过),它可以在建模数据库时知道,因为如果你记住这一点,你可以节省大量的存储空间。