Python和随机森林算法

时间:2014-12-13 13:18:54

标签: python random-forest

我尝试使用Python的随机森林ML(机器学习)算法和* .csv文件,这是* csv.file中的信息

DateTime;Status;Energy
28-02-2014 19:30:00;True;10,1
28-02-2011 06:15:00;False;15,6;
28-02-2011 06:30:00;False;15,2;
28-02-2011 06:45:00;False;15,6;
......

我需要使用哪些软件包或库(随机森林模型)进行分析?

我的代码:

from sklearn.ensemble import RandomForestClassifier
from numpy import genfromtxt, savetxt
    def main():
        dataset = genfromtxt(open("C:\\Users\\PVanDro\\Desktop\\Ddata\\Building0.csv"), delimiter=';', dtype='f8')[1:]
        target = [x[0] for x in dataset]
        train = [x[1:] for x in dataset]
        rf = RandomForestClassifier(n_estimators=100)
        rf.fit(train, target)
        savetxt("C:\\Users\\PVanDro\\Desktop\\Ddata\\Building0_0.csv", delimiter=';', fmt='%f')

    if __name__=='__main__':
         main()

但我有错误:

  File "C:/Users/PVanDro/Desktop/Folder for test/RandomForestExamples1/MainFile.py", line 17, in main
    rf.fit(train, target)
  File "C:\Python27\lib\site-packages\sklearn\ensemble\forest.py", line 224, in fit
    X, = check_arrays(X, dtype=DTYPE, sparse_format="dense")
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 283, in check_arrays
    _assert_all_finite(array)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 43, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

我如何解决这些错误?

1 个答案:

答案 0 :(得分:2)

这是great tutorial,可以解释您的需求。以下是一些示例代码。

from sklearn.ensemble import RandomForestClassifier
from numpy import genfromtxt, savetxt

def main():
    #create the training & test sets, skipping the header row with [1:]
    dataset = genfromtxt(open('Data/train.csv','r'), delimiter=',', dtype='f8')[1:]    
    target = [x[0] for x in dataset]
    train = [x[1:] for x in dataset]
    test = genfromtxt(open('Data/test.csv','r'), delimiter=',', dtype='f8')[1:]

    #create and train the random forest
    #multi-core CPUs can use: rf = RandomForestClassifier(n_estimators=100, n_jobs=2)
    rf = RandomForestClassifier(n_estimators=100)
    rf.fit(train, target)

    savetxt('Data/submission2.csv', rf.predict(test), delimiter=',', fmt='%f')

if __name__=="__main__":
    main()

创建新的预测数据集之后,您可以使用大量库通过图形可视化该数据。以下是一些:

  1. Bokeh - 基于Python的可视化库,用于基于Web的表示
  2. D3 - 另一个用于可视化数据的基于Web的JavaScript库。 Here是您可以使用CSV的一个示例。
  3. Ploty - 基于Python的可视化
  4. 还有更多内容,但您可以为此查询Google。 ;)