Question

我正在尝试预测服务器负载，但是我得到的准确率低于10％。我正在使用线性回归来预测数据，请问有什么可以帮助我的吗？

ps，csv文件包含日期和时间，因此我将两者都转换为整数。不确定我做对了吗

这些是我的代码：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import mpl_toolkits
import imp
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing

data = pd.read_csv(".....\\Machine_Learning_Serious\\Server_Prediction\\testing_server.csv")
describe = data.describe()

data_cleanup = {"Timestamp":{'AM': 0, 'PM': 1},
    "Function":{'ccpl_db01': 0, 'ccpl_fin01': 1, 'ccpl_web01': 2},
    "Type": {'% Disk Time': 0, 'CPU Load': 1, 'DiskFree%_C:': 2, 'DiskFree%_D:': 3, 'DiskFree%_E:': 4, 'FreeMemory': 5, 'IIS Current Connections': 6, 'Processor Queue Length': 7, 'SQL_Buffer cache hit ratio': 8, 'SQL_User Connections': 9}}
data.replace(data_cleanup,inplace = True)
final_data = data.head()
#print(final_data)

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
labels = data['Data']
train1 = data.drop(['Data'], axis = 1)

from sklearn.model_selection import train_test_split
from sklearn import ensemble
x_train , x_test , y_train , y_test = train_test_split(train1, labels, test_size = 0.25, random_state = 2)
#clf = ensemble.GradientBoostingRegressor(n_estimators= 400 , max_depth = 5,min_samples_split = 2, learning_rate = 0.1,loss='ls')
fitting = reg.fit(x_train,y_train)
score = reg.score(x_test,y_test)

主要目的是预测正确的负载，但现在我离这太远了。

Answer 1

也许首先进行一些探索性数据分析，看看是否可以找出目标变量和特征之间的模式？

从日期/时间变量中提取某些功能而不是将它们用作整数（例如，weekday_or_not，season等）也将是一件好事

您还可以尝试转换功能（日志，sqrt）以查看分数是否提高。

我还建议您尝试一个简单的randomforest / xgboost模型，以对照线性回归模型检查其性能

Python机器学习

1 个答案: