Python - scikit错误学习随机森林关于值格式

时间:2017-07-06 15:51:06

标签: python arrays floating-point random-forest

当我执行命令时:

clf.fit(train_data, train_label)

我收到以下错误

  

ValueError:输入包含NaN,无穷大或对于dtype来说太大的值(' float32')。

问题是数组train_data的大小(18000,20)。我试过使用这个命令:

clf.fit(np.float32(train_data), train_label)

train_data = np.array([s[0].astype('float32') for s in train_data])

在以下链接中找到列车文件(python)中的数据集train_data和train_label:

https://www.dropbox.com/s/b3017gi18x6x325/train?dl=0

但是,我无法得到数组中的所有值" train_data"对clf.fit函数有效。有什么帮助吗?

1 个答案:

答案 0 :(得分:1)

刚刚找到了解决此错误的解决方案。您需要缩放数据:

代码:

from sklearn.ensemble import RandomForestClassifier
import pickle
import numpy as np
from sklearn.preprocessing import scale

with open('train', 'rb') as f: 
    train_data, train_label = pickle.load(f)

#some diagnostic to see if there are NaNs. No NaN were found !
print(np.isnan(train_data))
print(np.where(np.isnan(train_data)))
print(np.nan_to_num(train_data))
print(np.isnan(train_label))
print(np.where(np.isnan(train_label)))

#so need to scale
train_data = scale(train_data)

clf = RandomForestClassifier()
clf.fit(train_data, train_label)