我是使用Spark的新手,它试图通过Java和cassandra中的cassandra通过spark获取大数据(400万条记录),并通过检索代码进行分组,但是检索数据需要花费大量的时间(50分钟),使76个PARTITIONS和每个PARTITION需要30秒钟的时间,我希望可以进行快速检索,以便任何机构对此代码提出任何建议。
预先感谢,我的罐子是:
----------
compile group: 'com.datastax.spark', name: 'spark-cassandra-connector_2.11', version: '2.4.1'
compile group: 'org.apache.spark', name: 'spark-streaming_2.11', version: '2.4.0'
compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.4.0'
compile group: 'org.apache.spark', name: 'spark-sql_2.11', version: '2.4.0'
testCompile group: 'org.apache.spark', name: 'spark-catalyst_2.11', version: '2.4.0'
----------
我的代码是
//configuratons
SparkConf conf = new SparkConf();
conf.setAppName("TODO spark and cassandra");
conf.setMaster("local");
conf.set("spark.cassandra.connection.host", "<host no>");
conf.set("spark.cassandra.connection.port", "9090");
conf.set("spark.cassandra.auth.username", "<user>");
conf.set("spark.cassandra.auth.password", "password");
conf.set("spark.ui.enabled", "true");
conf.set("spark.testing.memory", "2147480000");
conf.set("spark.cassandra.input.split.size_in_mb", "67108864");
SparkApp app = new SparkApp(conf);
JavaSparkContext sc = new JavaSparkContext(conf);
SparkContextJavaFunctions functions = CassandraJavaUtil.javaFunctions(sc);
JavaRDD<CassandraRow> rdd = functions.cassandraTable("<keyspacename>", "<table
name>");
JavaPairRDD<String, Integer> sizes = rdd.groupBy( new Function<CassandraRow,
String>() {
private static final long serialVersionUID = 1L;
@Override
public String call(CassandraRow row) throws Exception {
return row.getString("visitationpointtype");
}
}).
mapToPair(new PairFunction<Tuple2<String,Iterable<CassandraRow>>, String,
Integer>() {
private static final long serialVersionUID = 1L;
@Override
public Tuple2<String, Integer> call(Tuple2<String, Iterable<CassandraRow>> t)
throws Exception {
return new Tuple2<String,Integer>(t._1(),
Lists.newArrayList(t._2()).size());
}
});