我想将数据从Cassandra存档到Amazon S3,为此我正在使用DataStax SparkCassnadraConnector。我正在使用CQLContext的帮助来查询cassandra,如下面的代码所述。我还提到了我的cassandra的平均分区大小,这也是不错的。阅读SO上的几个链接,并了解增加spark.cassandra.input.split.size_in_mb
。但是这项工作仍需要很长时间才能读取数据并最终将其存储在Amazon S3上。我在这里做错了吗?
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "CASSANDRA_HOST_IPS").setAppName("ExportCassandraDatajob")
conf.set("spark.cassandra.input.split.size_in_mb", "128")
val sc = new SparkContext(conf)
sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId","ACCESSKEY")
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey","SECRETKEY")
sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("KEYSPACE_NAME")
val dFArchive = csc.sql("select * from SOME_COLUMN_FAMILY where partitionkey in ('key1' , 'key2' ...so on) " )
dFArchive.rdd.saveAsTextFile("S3_PATH",classOf[org.apache.hadoop.io.compress.GzipCodec])
cassandra columnfamily中的分区大小:
Compacted partition mean bytes: 173509526