此代码在新的CSV文件和现有的MongoDB数据之间进行汇总。现有记录和新记录都将保存到MongoDB中。此过程大约需要7-8个小时才能完成。我想提高此代码的处理速度。
if (slaveSchemaTable.limit(1).count() != 0) {
//Write new CSV files into MongoDB if the collection is empty. No need checking and validating
MongoSpark.save(csvSchemaTable.write().mode("overwrite"), writeConfig);
} else {
// Aggregating between the new CSV file and the existing MongoDB data. Both existing and new will be saved into MongoDB
slaveSchemaTable.createOrReplaceTempView("SlaveSchemaTable");
// slaveSchemaTable.show();
StringBuilder joinBuilder = new StringBuilder("SELECT b._id as _id,");
joinBuilder.append("a.msisdn as msisdn,");
joinBuilder.append("a.classification as classification,");
joinBuilder.append("a.event_date_actual as event_date_actual,");
joinBuilder.append("(IFNULL(a.up_vol_mb,0), IFNULL(b.up_vol_mb,0)) as up_vol_mb,");
joinBuilder.append("(IFNULL(a.down_vol_mb,0) + IFNULL(b.down_vol_mb,0)) as down_vol_mb,");
joinBuilder.append("(IFNULL(a.total_vol_mb,0) + IFNULL(b.total_vol_mb,0)) as total_vol_mb ");
joinBuilder.append("FROM FinalCsvSchemaTable a LEFT JOIN SlaveSchemaTable b ");
joinBuilder.append("ON (a.msisdn = b.msisdn) AND ");
joinBuilder.append("(a.classification = b.classification) AND ");
joinBuilder.append("(a.event_date_actual = b.event_date_actual) ");
Dataset<Row> slaveTable = sparkSession.sql(joinBuilder.toString());
MongoSpark.save(slaveTable.write().mode("append"), writeConfig);
}
MongoSpark.save(csvSchemaTable.write().mode("overwrite"), writeConfig);
log.debug("End Slave Aggregation Batch");
return true;
在上述代码的else
部分,我的处理速度很慢。我期望处理速度会有所提高,以便对大量文件执行最多需要20-30分钟。