Question

此代码在新的CSV文件和现有的MongoDB数据之间进行汇总。现有记录和新记录都将保存到MongoDB中。此过程大约需要7-8个小时才能完成。我想提高此代码的处理速度。

    if (slaveSchemaTable.limit(1).count() != 0) {
        //Write new CSV files into MongoDB if the collection is empty. No need checking and validating
        MongoSpark.save(csvSchemaTable.write().mode("overwrite"), writeConfig);
    } else {
        // Aggregating between the new CSV file and the existing MongoDB data. Both existing and new will be saved into MongoDB
        slaveSchemaTable.createOrReplaceTempView("SlaveSchemaTable");
        // slaveSchemaTable.show();

        StringBuilder joinBuilder = new StringBuilder("SELECT b._id as _id,");
        joinBuilder.append("a.msisdn as msisdn,");
        joinBuilder.append("a.classification as classification,");
        joinBuilder.append("a.event_date_actual as event_date_actual,");
        joinBuilder.append("(IFNULL(a.up_vol_mb,0), IFNULL(b.up_vol_mb,0)) as up_vol_mb,");
        joinBuilder.append("(IFNULL(a.down_vol_mb,0) + IFNULL(b.down_vol_mb,0)) as down_vol_mb,");
        joinBuilder.append("(IFNULL(a.total_vol_mb,0) + IFNULL(b.total_vol_mb,0)) as total_vol_mb ");
        joinBuilder.append("FROM FinalCsvSchemaTable a LEFT JOIN SlaveSchemaTable b ");
        joinBuilder.append("ON (a.msisdn = b.msisdn) AND ");
        joinBuilder.append("(a.classification = b.classification) AND ");
        joinBuilder.append("(a.event_date_actual = b.event_date_actual) ");

        Dataset<Row> slaveTable = sparkSession.sql(joinBuilder.toString());
        MongoSpark.save(slaveTable.write().mode("append"), writeConfig);
    }

    MongoSpark.save(csvSchemaTable.write().mode("overwrite"), writeConfig);

    log.debug("End Slave Aggregation Batch");
    return true;

在上述代码的else部分，我的处理速度很慢。我期望处理速度会有所提高，以便对大量文件执行最多需要20-30分钟。

如何增加对这段代码的处理

0 个答案: