Question

我是学习机器学习的初学者，我们有一个机器学习项目 Bodyfitness 预测。完成模型和评估后，当我预测模型时，结果不正确，例如我们使用随机森林分类器的因变量为 1 或 0 , 1 表示活动，0 表示不活动，但每次它只预测 0 并且我尝试更改所有值时，它只给出 0 做什么。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score  

df = pd.read_csv("Body fitness prediction.csv")

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()

df['mood']=le.fit_transform(df['mood'])
x = df.drop(['date','bool_of_active'],axis=1)
 #independent variables
x

from sklearn import model_selection, neighbors
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.1,random_state=0)
x_train1=x_train
x_test1=x_test

from sklearn.ensemble import RandomForestClassifier
rmf = RandomForestClassifier(max_depth=12,n_estimators=10,random_state=0)  #max_depth=12
rmf.fit(x_train, y_train)

rmf.score(x_train,y_train)*100
rmf_y_train=rmf.predict(x_train)
rmf_y_train

train=rmf.predict(x_test)
train

import pickle
with open('random.bin','wb') as f:
    pickle.dump(rmf,f)
    f.close()
with open('random.bin','rb') as f_in:
  model = pickle.load(f_in)
#while predicting the result gone wrong
rmf_y_new1 = model.predict(scaler.transform(np.array([[25,0,0,5,66]])))
rmf_y_new1
output: array([0])

rmf_y_new1 = model.predict(scaler.transform(np.array([[6401,0,223,5,66]])))
rmf_y_new1

output: array([0])

当我在模型中预测值时，我变得活跃 (1) 但是当我使用 Flask Api 预测值时，它给出了 inactive(0) 你能帮我解决这个问题吗？

app = Flask(__name__)


@app.route('/')
def home():
    return render_template('index2.html')

@app.route('/predict',methods=['POST'])
def predict():
    '''
    For rendering results on HTML GUI
    '''
    if request.method == 'POST':
        stepcount = request.form['stepcount']
        mood = request.form['mood']
        calories_burned = request.form['calories_burned']
        hours_of_sleep = request.form['hours_of_sleep']
        weight = request.form['weight']
        
        

        data = [(float(stepcount),float(mood),float(calories_burned),float(hours_of_sleep),float(weight))]
        
        scalar = StandardScaler()
        scl_fit = scalar.fit_transform(data)
    
        
        

        with open('random.bin','rb') as f_in:
             model = pickle.load(f_in)
        
        prediction = model.predict(scalar.transform(np.array(scl_fit)))
        

        
    
    return render_template('index2.html',output="your fitness is 1 means active 0 means   :{}".format(prediction) )
    


if __name__ == "__main__":
    app.run(debug=True)

Answer 1

您可以使用 GridSearch 来优化您的模型。见下文

# Split the dataset into training and validation sets
x_train, x_val, y_train, y_val = train_test_split(XXX, YYY, test_size = 0.3, random_state = 42)

然后使用GridSearch，详情请看这里https://scikit-learn.org/stable/modules/grid_search.html，你可以得到这个

# Set the parameter space in order to find the best hyperparameters for the MLP parameter_space = {
'n_estimators': [50, 200, 500],
'criterion': ['gini', 'entropy'],
'max_depth': [3, 5, 9],
'max_leaf_nodes': [1, 3, 5],
'max_features': ['auto', 'sqrt', 'log2'],
'min_samples_leaf': [1, 3, 5],
'min_samples_split': [1, 2, 5],

}

# Initialise a Random Forest and find the best parameter set by running a grid search with cross validation

rf = RandomForestClassifier(random_state = 42)

clf_rf = GridSearchCV(rf, parameter_space, n_jobs = -1, cv = 5, verbose = 10)

clf_rf.fit(x_train, y_train)

clf_rf.score(x_val, y_val)

当然，你应该检查数据集是否不平衡等，但这属于预处理步骤。

Answer 2

首先使用 RandomForest 中的默认参数。 n_estimators=10 太少了。如果您不知道参数的作用，则不要在不查看它的作用的情况下更改它。此外，在拆分中添加 stratify 参数。

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.1,random_state=0, stratify=y)

首先，n_estimators 表示要使用多少棵树。越多越好，直到它不会过拟合。此外，stratify 参数可确保 0/1 的比率在所有 y、y_train 和 y_test 中相同。这将使模型看到相同数量的 1。

Answer 3

这可能有多种原因。

第一个可以是 sampling bias。与标记为 1 的实例相比，您的训练数据中标记为 0 的实例数量可能要多得多。因此，您的模型将出现偏差并将所有实例归类为 0（非活动）。

另一个原因可能是过度拟合。您设置的 max_depth=12 根据数据集的维度可能太深。对此的解决方案是修剪您的树，即将其深度限制为较小的数字以防止过度拟合。尝试将 max_depth 设置为较小的整数值，例如 max_depth=5，看看是否有任何改进。

向模型提供输入时，机器学习模型预测会出错

3 个答案: