MS Azure N-Instance上的H2O深水(启用GPU)无法初始化后端

时间:2017-08-16 15:59:21

标签: h2o

更新1:

来自H2O深水云的日志文件:https://drive.google.com/file/d/0B_1g718qYsqhcUl4WFQ5S1NKbE0/view?usp=sharing

  • mxnet后端 - 现已解决(在Azure中停止/启动VM后)
  • tensorflow后端 - 仍然失败

我想在MS Azure上使用支持GPU的云实例测试H2O深水(NC6 - https://azure.microsoft.com/en-us/blog/azure-n-series-general-availability-on-december-1/)。 但是运行H2O Deep Water我得到一个错误说:

  • mxnet backend:java.lang.RuntimeException: Unable to initialize the native Deep Learning backend: Could not initialize class deepwater.backends.mxnet.MXNetBackend$MXNetLoader
  • tensorflow后端:java.lang.RuntimeException: Unable to initialize the native Deep Learning backend: null

配置和设置如下:

在NC6 VM上配置DSVM之后。我检查了深水的先决条件 - CUDA& CUDANN:

sysadmin@DEVSMTTSYGPU002:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
sysadmin@DEVSMTTSYGPU002:~$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR      5
#define CUDNN_MINOR      1
#define CUDNN_PATCHLEVEL 10

之后我运行了以下步骤:

设置env vars:

  • export CUDA_PATH=/usr/local/cuda
  • export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH

为python 2.7安装pip

  • sudo apt-get install python-pip

安装深水:

  • pip2 install http://s3.amazonaws.com/h2o-deepwater/public/nightly/latest/h2o-3.13.0-py2.py3-none-any.whl

安装libatlas-base-dev

  • sudo apt-get install libatlas-base-dev

要运行示例,我启动python 2.7并运行

import h2o
h2o.init()

之后我使用H2O Flow创建一些人工数据并学习一个简单的深水模型

  • createFrame {"dest":"MNIST_SIM_60k","rows":"60000","cols":"784","seed":7595850248774472000,"seed_for_column_types":-1,"randomize":true,"value":0,"real_range":100,"categorical_fraction":"0","factors":5,"integer_fraction":"1","binary_fraction":"0","binary_ones_fraction":"0","time_fraction":0,"string_fraction":0,"integer_range":"127","missing_fraction":"0","response_factors":2,"has_response":true}
  • buildModel 'deepwater', {"model_id":"deepwater-782cc564-497c-4c39-a22a-b6904fb04188","training_frame":"MNIST_SIM_60k","nfolds":0,"response_column":"response","ignored_columns":[],"epochs":"100","ignore_const_cols":true,"network":"auto","activation":"Rectifier","hidden":[100],"problem_type":"dataset","checkpoint":"","autoencoder":false,"balance_classes":false,"score_each_iteration":false,"categorical_encoding":"AUTO","train_samples_per_iteration":-2,"standardize":true,"distribution":"AUTO","score_interval":5,"score_training_samples":10000,"score_validation_samples":0,"score_duty_cycle":0.1,"stopping_rounds":5,"stopping_metric":"AUTO","stopping_tolerance":0,"max_runtime_secs":0,"backend":"tensorflow","image_shape":[0,0],"channels":3,"network_definition_file":"","network_parameters_file":"","mean_image_file":"","export_native_parameters_prefix":"","input_dropout_ratio":0,"hidden_dropout_ratios":[],"overwrite_with_best_model":true,"target_ratio_comm_to_comp":0.05,"seed":-1,"learning_rate":0.001,"learning_rate_annealing":0.000001,"momentum_start":0.9,"momentum_ramp":10000,"momentum_stable":0.9,"classification_stop":0,"shuffle_training_data":true,"mini_batch_size":32,"clip_gradient":10,"sparse":false,"gpu":true,"device_id":[0],"cache_data":true}

对于后端(mxnet和tensorflow),我得到了上面提到的错误。对于张量流,堆栈跟踪是

java.lang.RuntimeException: Unable to initialize the native Deep Learning backend: null
    at hex.deepwater.DeepWaterModelInfo.setupNativeBackend(DeepWaterModelInfo.java:267)
    at hex.deepwater.DeepWaterModelInfo.<init>(DeepWaterModelInfo.java:214)
    at hex.deepwater.DeepWaterModel.<init>(DeepWaterModel.java:227)
    at hex.deepwater.DeepWater$DeepWaterDriver.buildModel(DeepWater.java:131)
    at hex.deepwater.DeepWater$DeepWaterDriver.computeImpl(DeepWater.java:118)
    at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
    at hex.deepwater.DeepWater$DeepWaterDriver.compute2(DeepWater.java:111)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1255)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

对于mxnet,stacktrace是

java.lang.RuntimeException: Unable to initialize the native Deep Learning backend: Could not initialize class deepwater.backends.mxnet.MXNetBackend$MXNetLoader
    at hex.deepwater.DeepWaterModelInfo.setupNativeBackend(DeepWaterModelInfo.java:267)
    at hex.deepwater.DeepWaterModelInfo.<init>(DeepWaterModelInfo.java:214)
    at hex.deepwater.DeepWaterModel.<init>(DeepWaterModel.java:227)
    at hex.deepwater.DeepWater$DeepWaterDriver.buildModel(DeepWater.java:131)
    at hex.deepwater.DeepWater$DeepWaterDriver.computeImpl(DeepWater.java:118)
    at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
    at hex.deepwater.DeepWater$DeepWaterDriver.compute2(DeepWater.java:111)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1255)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

如何使用至少一个后端运行H2O Deep Water?

旁注:来自H2O支持GPU的xgboost工作。

非常感谢

罗伯特

1 个答案:

答案 0 :(得分:0)

我认为除了使用docker镜像之外我们还没有尝试过运行Azure。你在使用Ubuntu 16.04吗?如果是这样,它应该工作,除非它与标准Ubuntu 16.04之间存在差异。好像h2o无法与后端通信。如果你可以从h2o发布完整的日志,我可以试着看看问题是什么。

否则我会说运行它的最简单方法是使用docker镜像,这就是我的建议。一切都已经安装好了。您需要安装的唯一东西是docker和nvidia-docker。说明:https://github.com/h2oai/deepwater#pre-release-docker-image