将ParallelWrapper与多个GPU搭配使用时,为什么保存MultiLayerNetwork模型的速度非常慢

时间:2019-04-25 18:03:02

标签: deeplearning4j dl4j

我正在以下环境中工作:Windows 10,Eclipse, cuda.version = 10.0,bytedecoPresets.version = 10.0-7.3-1.4.3

当使用ParallelWrapper和多个GPU训练网络时,使用LocalFileModelSaver保存MultiLayerNetwork的速度非常慢。当将EarlyStoppingTrainer与单个GPU一起使用时,保存模型的时间与预期的一样。我的配置是否不正确,或者这是DL4J中的错误?

在保存模型时,我使用了调试器来逐步浏览DL4J代码。如果使用多个GPU和并行包装器,则会生成以下堆栈跟踪:

线程[main](已暂停(断点在cudaEvent_t中的第69行))
    cudaEvent_t.synchronize()行:69
    GridFlowController(SynchronousFlowController).waitTillFinished(AllocationPoint)行:134
    GridFlowController.waitTillFinished(AllocationPoint)行:63
    GridFlowController.synchronizeToHost(AllocationPoint)行:47
    CudaZeroHandler.synchronizeThreadDevice(Long,Integer,AllocationPoint)行:1304
    AtomicAllocator.synchronizeHostData(DataBuffer)行:370
    CudaFloatDataBuffer(BaseCudaDataBuffer).getFloat(long)行:1131
    CudaFloatDataBuffer(BaseDataBuffer).write(DataOutputStream)行:1562
    CudaFloatDataBuffer(BaseCudaDataBuffer).write(DataOutputStream)行:801
    Nd4j.write(INDArray,DataOutputStream)行:2464
    ModelSerializer.writeModel(Model,OutputStream,boolean,DataNormalization)行:156
    ModelSerializer.writeModel(Model,OutputStream,boolean)行:119
    LenetNetworkTrainer.trainNetwork(boolean)行:251     ImageClassificationTrainer.run()行:88
    ImageClassificationTrainer.main(String [])行:217

似乎在编写CudaFloatDataBuffer时每次检索浮点值时都会调用cudaEvent_t.synchronize()(我相信这是问题的根源)。

使用单个GPU和EarlyStoppingTrainer时的等效堆栈跟踪为:

线程[main](已暂停)
GridFlowController(SynchronousFlowController).synchronizeToHost(AllocationPoint)行:97
    GridFlowController.synchronizeToHost(AllocationPoint)行:50
    CudaZeroHandler.synchronizeThreadDevice(Long,Integer,AllocationPoint)行:1304
    AtomicAllocator.synchronizeHostData(DataBuffer)行:370
    CudaFloatDataBuffer(BaseCudaDataBuffer).getFloat(long)行:1131
    CudaFloatDataBuffer(BaseDataBuffer).write(DataOutputStream)行:1562
    CudaFloatDataBuffer(BaseCudaDataBuffer).write(DataOutputStream)行:801
    Nd4j.write(INDArray,DataOutputStream)行:2464
    ModelSerializer.writeModel(Model,OutputStream,boolean,DataNormalization)行:156
    ModelSerializer.writeModel(Model,OutputStream,boolean)行:119
    ModelSerializer.writeModel(Model,String,boolean)行:106
    LocalFileModelSaver.save(MultiLayerNetwork,String)行:99
    LocalFileModelSaver.saveBestModel(MultiLayerNetwork,double)行:77
    LocalFileModelSaver.saveBestModel(Model,double)第42行:
    EarlyStoppingTrainer(BaseEarlyStoppingTrainer).fit()行:223
    LenetNetworkTrainer.trainNetwork(boolean)行:228     ImageClassificationTrainer.run()行:88
    ImageClassificationTrainer.main(String [])行:217

cudaEvent_t.synchronize()不会为每个写入的浮点数调用。

        network = new MultiLayerNetwork(getConfiguration(
            trainingIterator.getLabels().size(), this.builder.getRandomSeed(),
            this.builder.getImageWidth(), this.builder.getImageHeight(),
            this.builder.getImageChannels(), this.builder.getEpochs(),
            iterators.getLeft().getRight(), this.builder.getBatchSize(),
            this.builder.getLearningRateInitialValue(),
            this.builder.getLearningRateDecayExponent()));
        network.init();

        ParallelWrapper wrapper = new ParallelWrapper.Builder<>(network)
            .prefetchBuffer(8).workers(2).averagingFrequency(3)
            .reportScoreAfterAveraging(true).build();
        LocalFileModelSaver localFileModelSaver = new LocalFileModelSaver(
            this.builder.getTrainingPath().getAbsolutePath());
       DataSetLossCalculator dataSetLossCalculator = new dataSetLossCalculator(testingIterator, true);

        for (int i = 0; i < this.builder.getEpochs(); i++) {
          wrapper.fit(trainingIterator);
        }

        //
        // This takes about 25 minutes to save the model
        //
        localFileModelSaver.saveLatestModel(network, 0.0);

我希望模型可以在15到30秒内保存到磁盘。我想念什么吗?

0 个答案:

没有答案