基于官方的分布式模型训练示例(https://github.com/rapidsai/cuml/blob/branch-0.18/notebooks/random_forest_mnmg_demo.ipynb),我使用 Iris 数据集在多 GPU dask 集群(一个调度程序节点,三个工作节点)上训练随机森林模型,但是模型可以不训练。结果如下:
CuML accuracy: 0.36666666666666664
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
File "cuml/ensemble/randomforestclassifier.pyx", line 334, in cuml.ensemble.randomforestclassifier.RandomForestClassifier.__del__
File "cuml/ensemble/randomforestclassifier.pyx", line 350, in cuml.ensemble.randomforestclassifier.RandomForestClassifier._reset_forest_data
AttributeError: 'NoneType' object has no attribute 'free_treelite_model'
Process finished with exit code 0
我的环境是通过conda命令构建的:
conda create -n rapids-0.18 -c rapidsai -c nvidia -c conda-forge \
-c defaults rapids-blazing=0.18 python=3.8 cudatoolkit=10.2
我用于 RAPIDs RandomForestClassifier 的代码是:
import pandas as pd
import cudf
import cuml
from cuml import train_test_split
from cuml.metrics import accuracy_score
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask_cudf
from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRF
# start dask cluster
c = Client('node0:8786')
# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
n_streams = 8 # Performance optimization
# Random Forest building parameters
max_depth = 12
n_bins = 16
n_trees = 1000
# Read data
pdf = pd.read_csv('/data/iris.csv',header = 0, delimiter = ',') # Get complete CSV
cdf = cudf.from_pandas(pdf) # Get cuda dataframe
features = cdf.iloc[:, [0, 1, 2, 3]].astype('float32') # Get data columns
labels = cdf.iloc[:, 4].astype('category').cat.codes.astype('int32') # Get label column
# Split train and test data
X_train, X_test, y_train, y_test = train_test_split(feature, label, train_size=0.8, shuffle=True)
# Distribute data to worker GPUs
n_partitions = n_workers
X_train_dask = dask_cudf.from_cudf(X_train, npartitions=n_partitions)
X_test_dask = dask_cudf.from_cudf(X_test, npartitions=n_partitions)
y_train_dask = dask_cudf.from_cudf(y_train, npartitions=n_partitions)
# Train the distributed cuML model
cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins, n_streams=n_streams)
cuml_model.fit(X_train_dask, y_train_dask)
wait(cuml_model.rfs) # Allow asynchronous training tasks to finish
# Predict and check accuracy
cuml_y_pred = cuml_model.predict(X_test_dask).compute().to_array()
print("CuML accuracy: ", accuracy_score(y_test.to_array(), cuml_y_pred))
使用 LocalCUDACluster 后结果没有改变。
你能指出我的错误并给我正确的代码吗?如果我想在经过训练的随机森林模型上评估决策树,我怎样才能得到那些经过训练的决策树?
谢谢。