我正在以下环境中工作:Windows 10,Eclipse, cuda.version = 10.0,bytedecoPresets.version = 10.0-7.3-1.4.3
当使用ParallelWrapper和多个GPU训练网络时,使用LocalFileModelSaver保存MultiLayerNetwork的速度非常慢。当将EarlyStoppingTrainer与单个GPU一起使用时,保存模型的时间与预期的一样。我的配置是否不正确,或者这是DL4J中的错误?
在保存模型时,我使用了调试器来逐步浏览DL4J代码。如果使用多个GPU和并行包装器,则会生成以下堆栈跟踪:
线程[main](已暂停(断点在cudaEvent_t中的第69行))
cudaEvent_t.synchronize()行:69
GridFlowController(SynchronousFlowController).waitTillFinished(AllocationPoint)行:134
GridFlowController.waitTillFinished(AllocationPoint)行:63
GridFlowController.synchronizeToHost(AllocationPoint)行:47
CudaZeroHandler.synchronizeThreadDevice(Long,Integer,AllocationPoint)行:1304
AtomicAllocator.synchronizeHostData(DataBuffer)行:370
CudaFloatDataBuffer(BaseCudaDataBuffer).getFloat(long)行:1131
CudaFloatDataBuffer(BaseDataBuffer).write(DataOutputStream)行:1562
CudaFloatDataBuffer(BaseCudaDataBuffer).write(DataOutputStream)行:801
Nd4j.write(INDArray,DataOutputStream)行:2464
ModelSerializer.writeModel(Model,OutputStream,boolean,DataNormalization)行:156
ModelSerializer.writeModel(Model,OutputStream,boolean)行:119
LenetNetworkTrainer.trainNetwork(boolean)行:251
ImageClassificationTrainer.run()行:88
ImageClassificationTrainer.main(String [])行:217
似乎在编写CudaFloatDataBuffer时每次检索浮点值时都会调用cudaEvent_t.synchronize()(我相信这是问题的根源)。
使用单个GPU和EarlyStoppingTrainer时的等效堆栈跟踪为:
线程[main](已暂停)
GridFlowController(SynchronousFlowController).synchronizeToHost(AllocationPoint)行:97
GridFlowController.synchronizeToHost(AllocationPoint)行:50
CudaZeroHandler.synchronizeThreadDevice(Long,Integer,AllocationPoint)行:1304
AtomicAllocator.synchronizeHostData(DataBuffer)行:370
CudaFloatDataBuffer(BaseCudaDataBuffer).getFloat(long)行:1131
CudaFloatDataBuffer(BaseDataBuffer).write(DataOutputStream)行:1562
CudaFloatDataBuffer(BaseCudaDataBuffer).write(DataOutputStream)行:801
Nd4j.write(INDArray,DataOutputStream)行:2464
ModelSerializer.writeModel(Model,OutputStream,boolean,DataNormalization)行:156
ModelSerializer.writeModel(Model,OutputStream,boolean)行:119
ModelSerializer.writeModel(Model,String,boolean)行:106
LocalFileModelSaver.save(MultiLayerNetwork,String)行:99
LocalFileModelSaver.saveBestModel(MultiLayerNetwork,double)行:77
LocalFileModelSaver.saveBestModel(Model,double)第42行:
EarlyStoppingTrainer(BaseEarlyStoppingTrainer).fit()行:223
LenetNetworkTrainer.trainNetwork(boolean)行:228
ImageClassificationTrainer.run()行:88
ImageClassificationTrainer.main(String [])行:217
cudaEvent_t.synchronize()不会为每个写入的浮点数调用。
network = new MultiLayerNetwork(getConfiguration(
trainingIterator.getLabels().size(), this.builder.getRandomSeed(),
this.builder.getImageWidth(), this.builder.getImageHeight(),
this.builder.getImageChannels(), this.builder.getEpochs(),
iterators.getLeft().getRight(), this.builder.getBatchSize(),
this.builder.getLearningRateInitialValue(),
this.builder.getLearningRateDecayExponent()));
network.init();
ParallelWrapper wrapper = new ParallelWrapper.Builder<>(network)
.prefetchBuffer(8).workers(2).averagingFrequency(3)
.reportScoreAfterAveraging(true).build();
LocalFileModelSaver localFileModelSaver = new LocalFileModelSaver(
this.builder.getTrainingPath().getAbsolutePath());
DataSetLossCalculator dataSetLossCalculator = new dataSetLossCalculator(testingIterator, true);
for (int i = 0; i < this.builder.getEpochs(); i++) {
wrapper.fit(trainingIterator);
}
//
// This takes about 25 minutes to save the model
//
localFileModelSaver.saveLatestModel(network, 0.0);
我希望模型可以在15到30秒内保存到磁盘。我想念什么吗?