Apache Spark写入需要几个小时

时间:2018-12-18 06:46:16

标签: apache-spark cassandra apache-spark-sql spark-cassandra-connector

我有一个spark作业,它基于两个表进行正确的联接,读取和联接非常快,但是当尝试将联接结果插入cassandra db时,它是如此之慢。插入1000行需要30多分钟,插入9行记录需要3分钟。请在下面查看我的配置。我们有3个cassandra和spark节点,并为所有节点安装了spark。我对Spark非常陌生,无法理解出了什么问题。 ı可以使用小于1秒(超过2000行)的dse驱动程序插入相同大小的数据。感谢您的时间和帮助!

火花提交:

"dse -u " + username + " -p " + password + " spark-submit --class com.SparkJoin --executor-memory=20G  " +
                "SparkJoinJob-1.0-SNAPSHOT.jar " + filterMap.toString() + "

Spark核心版本:2.7.2

spark-cassandra-connector_2.11:2.3.1

spark-sql_2.11:2.3.1

Spark Conf

  SparkConf conf = new SparkConf(true).setAppName("Appname");
    conf.set("spark.cassandra.connection.host", host);
    conf.set("spark.cassandra.auth.username", username);
    conf.set("spark.cassandra.auth.password", password);

    conf.set("spark.network.timeout", "600s");
    conf.set("spark.cassandra.connection.keep_alive_ms", "25000");
    conf.set("spark.cassandra.connection.timeout_ms", "5000000");
    conf.set("spark.sql.broadcastTimeout", "5000000");
    SparkContext sc = new SparkContext(conf);

    SparkSession sparkSession = SparkSession.builder().sparkContext(sc).getOrCreate();
    SQLContext sqlContext = sparkSession.sqlContext();

    sqlContext.setConf("spark.cassandra.connection.host", host);
    sqlContext.setConf("spark.cassandra.auth.username", username);
    sqlContext.setConf("spark.cassandra.auth.password", password);
    sqlContext.setConf("spark.network.timeout", "600s");
    sqlContext.setConf("spark.cassandra.connection.keep_alive_ms", "2500000");
    sqlContext.setConf("spark.cassandra.connection.timeout_ms", "5000000");
    sqlContext.setConf("spark.sql.broadcastTimeout", "5000000");
    sqlContext.setConf("spark.executor.heartbeatInterval", "5000000");
    sqlContext.setConf("spark.sql.crossJoin.enabled", "true");

获取左右表;

  Dataset<Row> resultsFrame = sqlContext.sql("select * from table where conditions");
return resultsFrame.map((MapFunction<Row, JavaObject>) row -> {
// some operations here

                return obj;
            }, Encoders.bean(JavaObject.class)
    );

加入

   Dataset<Row> result = RigtTableJavaRDD.join(LeftTableJavaRDD,
            (LeftTableJavaRDD.col("col1").minus(RigtTableJavaRDD.col("col2"))).
                    between(new BigDecimal("0").subtract(twoHundredMilliseconds), new BigDecimal("0").add(twoHundredMilliseconds))
                    .and(LeftTableJavaRDD.col("col5").equalTo(RigtTableJavaRDD.col("col6")))
            , "right");

插入结果

  CassandraJavaUtil.javaFunctions(resultRDD.javaRDD()).
            writerBuilder("keyspace", "table", CassandraJavaUtil.mapToRow(JavaObject.class)).
            saveToCassandra();

0 个答案:

没有答案