我是学习机器学习的初学者,我们有一个机器学习项目 Bodyfitness 预测。完成模型和评估后,当我预测模型时,结果不正确,例如我们使用随机森林分类器的因变量为 1 或 0 , 1 表示活动,0 表示不活动,但每次它只预测 0 并且我尝试更改所有值时,它只给出 0 做什么。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
df = pd.read_csv("Body fitness prediction.csv")
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['mood']=le.fit_transform(df['mood'])
x = df.drop(['date','bool_of_active'],axis=1)
#independent variables
x
from sklearn import model_selection, neighbors
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.1,random_state=0)
x_train1=x_train
x_test1=x_test
from sklearn.ensemble import RandomForestClassifier
rmf = RandomForestClassifier(max_depth=12,n_estimators=10,random_state=0) #max_depth=12
rmf.fit(x_train, y_train)
rmf.score(x_train,y_train)*100
rmf_y_train=rmf.predict(x_train)
rmf_y_train
train=rmf.predict(x_test)
train
import pickle
with open('random.bin','wb') as f:
pickle.dump(rmf,f)
f.close()
with open('random.bin','rb') as f_in:
model = pickle.load(f_in)
#while predicting the result gone wrong
rmf_y_new1 = model.predict(scaler.transform(np.array([[25,0,0,5,66]])))
rmf_y_new1
output: array([0])
rmf_y_new1 = model.predict(scaler.transform(np.array([[6401,0,223,5,66]])))
rmf_y_new1
output: array([0])
当我在模型中预测值时,我变得活跃 (1) 但是当我使用 Flask Api 预测值时,它给出了 inactive(0) 你能帮我解决这个问题吗?
app = Flask(__name__)
@app.route('/')
def home():
return render_template('index2.html')
@app.route('/predict',methods=['POST'])
def predict():
'''
For rendering results on HTML GUI
'''
if request.method == 'POST':
stepcount = request.form['stepcount']
mood = request.form['mood']
calories_burned = request.form['calories_burned']
hours_of_sleep = request.form['hours_of_sleep']
weight = request.form['weight']
data = [(float(stepcount),float(mood),float(calories_burned),float(hours_of_sleep),float(weight))]
scalar = StandardScaler()
scl_fit = scalar.fit_transform(data)
with open('random.bin','rb') as f_in:
model = pickle.load(f_in)
prediction = model.predict(scalar.transform(np.array(scl_fit)))
return render_template('index2.html',output="your fitness is 1 means active 0 means :{}".format(prediction) )
if __name__ == "__main__":
app.run(debug=True)
答案 0 :(得分:1)
您可以使用 GridSearch 来优化您的模型。见下文
# Split the dataset into training and validation sets
x_train, x_val, y_train, y_val = train_test_split(XXX, YYY, test_size = 0.3, random_state = 42)
然后使用GridSearch,详情请看这里https://scikit-learn.org/stable/modules/grid_search.html,你可以得到这个
# Set the parameter space in order to find the best hyperparameters for the MLP parameter_space = {
'n_estimators': [50, 200, 500],
'criterion': ['gini', 'entropy'],
'max_depth': [3, 5, 9],
'max_leaf_nodes': [1, 3, 5],
'max_features': ['auto', 'sqrt', 'log2'],
'min_samples_leaf': [1, 3, 5],
'min_samples_split': [1, 2, 5],
}
# Initialise a Random Forest and find the best parameter set by running a grid search with cross validation
rf = RandomForestClassifier(random_state = 42)
clf_rf = GridSearchCV(rf, parameter_space, n_jobs = -1, cv = 5, verbose = 10)
clf_rf.fit(x_train, y_train)
clf_rf.score(x_val, y_val)
当然,你应该检查数据集是否不平衡等,但这属于预处理步骤。
答案 1 :(得分:0)
首先使用 RandomForest 中的默认参数。 n_estimators=10
太少了。如果您不知道参数的作用,则不要在不查看它的作用的情况下更改它。此外,在拆分中添加 stratify
参数。
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.1,random_state=0, stratify=y)
首先,n_estimators
表示要使用多少棵树。越多越好,直到它不会过拟合。此外,stratify
参数可确保 0/1
的比率在所有 y
、y_train
和 y_test
中相同。这将使模型看到相同数量的 1
。
答案 2 :(得分:0)
这可能有多种原因。
第一个可以是 sampling bias
。与标记为 1 的实例相比,您的训练数据中标记为 0 的实例数量可能要多得多。因此,您的模型将出现偏差并将所有实例归类为 0(非活动)。
另一个原因可能是过度拟合。您设置的 max_depth=12
根据数据集的维度可能太深。对此的解决方案是修剪您的树,即将其深度限制为较小的数字以防止过度拟合。尝试将 max_depth 设置为较小的整数值,例如 max_depth=5
,看看是否有任何改进。