目标是用随机森林预测这个数据集的价格。
+---------+--------+--------+-------+------------+
| | weight | color | price | |
+---------+--------+--------+-------+------------+
| 1 | 2 | blue | 20 | = 2 x 10 |
+---------+--------+--------+-------+------------+
| 2 | 2 | red | 60 | = 2 x 30 |
+---------+--------+--------+-------+------------+
| 3 | 3 | blue | 30 | = 3 x 10 |
+---------+--------+--------+-------+------------+
| 4 | 1 | yellow | 5 | = 1 x 5 |
+---------+--------+--------+-------+------------+
| ... | ... | ... | ... | ... |
+---------+--------+--------+-------+------------+
| 1200000 | 4 | blue | 40 | = 4 x 10 |
+---------+--------+--------+-------+------------+
首先将颜色列中的字符串转换为整数值。
+---+--------+-----+
| | color | int |
+---+--------+-----+
| 1 | yellow | 1 |
+---+--------+-----+
| 2 | blue | 2 |
+---+--------+-----+
| 3 | red | 3 |
+---+--------+-----+
| 4 | ... | ... |
+---+--------+-----+
所以数据集应该是这样的:
+---------+--------+--------+-------+------------+
| | weight | color | price | |
+---------+--------+--------+-------+------------+
| 1 | 2 | 2 | 20 | = 2 x 10 |
+---------+--------+--------+-------+------------+
| 2 | 2 | 3 | 60 | = 2 x 30 |
+---------+--------+--------+-------+------------+
| 3 | 3 | 2 | 30 | = 3 x 10 |
+---------+--------+--------+-------+------------+
| 4 | 1 | 1 | 5 | = 1 x 5 |
+---------+--------+--------+-------+------------+
| ... | ... | ... | ... | ... |
+---------+--------+--------+-------+------------+
| 1200000 | 4 | 2 | 40 | = 4 x 10 |
+---------+--------+--------+-------+------------+
这是对它们进行分类的正确方法吗? 然后随机森林必须预测测试集中的价格列。 随机森林算法如何理解其中一些值比其他值多?
data = df.drop('price', axis = 1)
data = np.array(data)
train_X, test_X, train_y, test_y = train_test_split(data, test_size = 0.25, random_state = 42)
rf = RandomForestRegressor(n_estimators= 1000, random_state=42)
rf.fit(train_X, train_y);
predictions = rf.predict(test_X)