Question

我在cassandra中有一个非常大的表（~500mil），我想将某些列的所有行导出到文件中。我使用COPY命令尝试了这个：

COPY keyspace.table (id, value) TO 'filepath' WITH DELIMITER=',';

但完成出口需要大约12个小时。有什么选择可以更快地完成吗？

如果仅导出某些列是个问题，那么导出所有数据就不会有问题。重要的是我需要一种方法来获取我之后可以继续的所有条目。

另一个问题是，是否可以使用DataStax PHP驱动程序在PHP中处理此导出？

Answer 1

COPY ... TO ... not a good idea to use on a big amount of data。

是否可以使用DataStax PHP驱动程序
在PHP中处理此导出

我在Datastax Java驱动程序的帮助下从Cassandra进行了CSV导出，但PHP必须具有相同的算法。根据{{3}}，您可以轻松地执行请求并打印输出。请注意documentation。

您可以借助pagination功能

将数组转换为CSV

所以，最简单的例子是：

<?php
$cluster   = Cassandra::cluster()                 // connects to localhost by default
                 ->build();
$keyspace  = 'system';
$session   = $cluster->connect($keyspace);        // create session, optionally scoped to a keyspace
$statement = new Cassandra\SimpleStatement(       // also supports prepared and batch statements
    'SELECT keyspace_name, columnfamily_name FROM schema_columnfamilies'
);
$future    = $session->executeAsync($statement);  // fully asynchronous and easy parallel execution
$result    = $future->get();                      // wait for the result, with an optional timeout

// Here you can print CSV headers.

foreach ($result as $row) {                       // results and rows implement Iterator, Countable and ArrayAccess
    // Here you can print CSV values  
    // printf("The keyspace %s has a table called %s\n", $row['keyspace_name'], $row['columnfamily_name']);
}

Answer 2

简短的回答是肯定的，有更快的方法可以做到这一点。

如果您要定期将这些行保存到文件，那么如何更长的答案 - 您可能希望使用Apache Spark。根据Cassandra节点上的内存量，您可以带来一个简单的5亿行表扫描=＆gt;写入文件到＆lt; 1小时。

从cassandra将完整表导出到csv

2 个答案: