Cassandra - > 500mb CSV文件产生约50mb大小的表?

时间:2015-09-20 02:35:32

标签: cassandra

我是Cassandra的新手并试图弄清楚尺寸如何起作用。我创建了一个键空间和一个表。然后我生成了一个脚本,在java中创建了100万行到csv文件中并将其插入到我的数据库中。 CSV文件大小约为545 mb。然后我将其加载到数据库并运行 nodetool cfstats 命令并接收此输出。它表示使用的总空间为50555052字节(~50 mb)。怎么会这样?随着索引,列等的开销,我的总数据如何小于原始CSV数据(不仅更小,而且更小)?也许我没有在这里正确阅读,但这看起来是对的吗?我在一台机器上使用Cassandra 2.2.1。

Table: users
        SSTable count: 1
        Space used (live): 50555052
        Space used (total): 50555052
        Space used by snapshots (total): 0
        Off heap memory used (total): 1481050
        SSTable Compression Ratio: 0.03029072054256705
        Number of keys (estimate): 984133
        Memtable cell count: 240336
        Memtable data size: 18385704
        Memtable off heap memory used: 0
        Memtable switch count: 19
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 1000000
        Local write latency: 0.044 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 1192632
        Bloom filter off heap memory used: 1192624
        Index summary off heap memory used: 203778
        Compression metadata off heap memory used: 84648
        Compacted partition minimum bytes: 643
        Compacted partition maximum bytes: 770
        Compacted partition mean bytes: 770
        Average live cells per slice (last five minutes): 0.0
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0

生成CSV文件的我的Java代码如下所示:

try{

            FileWriter writer = new FileWriter(sFileName);
            for(int i=0;i<1000000;i++){


            writer.append("Username " + i);
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("myfakeemailaccnt@email.com");
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
            writer.append(',');
            writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
            writer.append(',');
            writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
            writer.append(',');
            writer.append("tr");
            writer.append('\n');

            }   
            writer.flush();
            writer.close();

        }
        catch(IOException e)
        {
             e.printStackTrace();
        } 

1 个答案:

答案 0 :(得分:1)

所以我想到了最大的3个数据:

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ
并且认为他们是相同的也许Cassandra正在压缩他们,即使它说它只是3%的比例。所以我改变了我的Java代码以生成不同的数据。

public class Main {

    private static final String ALPHA_NUMERIC_STRING = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";

    public static void main(String[] args) {

        generateCassandraCSVData("users.csv");

    }

    public static String randomAlphaNumeric(int count) {
        StringBuilder builder = new StringBuilder();
        while (count-- != 0) {
        int character = (int)(Math.random()*ALPHA_NUMERIC_STRING.length());
        builder.append(ALPHA_NUMERIC_STRING.charAt(character));
        }
        return builder.toString();
        }


    public static void generateCassandraCSVData(String sFileName){

    java.util.Date date= new java.util.Date();


        try{

            FileWriter writer = new FileWriter(sFileName);
            for(int i=0;i<1000000;i++){



            writer.append("Username " + i);
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("myfakeemailaccnt@email.com");
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("" + randomAlphaNumeric(150) + "");
            writer.append(',');
            writer.append("" + randomAlphaNumeric(150) + "");
            writer.append(',');
            writer.append("" + randomAlphaNumeric(150) + "");
            writer.append(',');
            writer.append("tr");
            writer.append('\n');


            //generate whatever data you want
            }   
            writer.flush();
            writer.close();

        }
        catch(IOException e)
        {
             e.printStackTrace();
        } 

    }

}

所以现在这三个大列的数据都是随机字符串,不再相同了。这就是现在制作的:

Table: users
        SSTable count: 4
        Space used (live): 554671040
        Space used (total): 554671040
        Space used by snapshots (total): 0
        Off heap memory used (total): 1886175
        SSTable Compression Ratio: 0.6615549506522498
        Number of keys (estimate): 1019477
        Memtable cell count: 270024
        Memtable data size: 20758095
        Memtable off heap memory used: 0
        Memtable switch count: 25
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 1323546
        Local write latency: 0.048 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 1533512
        Bloom filter off heap memory used: 1533480
        Index summary off heap memory used: 257175
        Compression metadata off heap memory used: 95520
        Compacted partition minimum bytes: 311
        Compacted partition maximum bytes: 770
        Compacted partition mean bytes: 686
        Average live cells per slice (last five minutes): 0.0
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0

所以现在CSV文件再次~550mb,我的表现在也是~550mb。那么似乎非关键列数据是相同的(低基数)Cassandra它以某种方式非常有效地压缩这些数据?如果是这种情况,那么这是一个非常重要的概念(我以前从未读过),它可以在建模数据库时知道,因为如果你记住这一点,你可以节省大量的存储空间。