我已经为线性回归模型编写了代码,但是它给出的准确性较低。但是我期望比当前的准确性更高。
应该怎么做才能提高最高准确性
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
train_dataset = pd.read_csv("train.csv")
test_dataset = pd.read_csv("test.csv")
train_dataset.isna().sum()
test_dataset.isna().sum()
train_dataset['Age']=train_dataset['Age'].fillna(train_dataset['Age'].mean())
train_dataset['Time_of_service']=train_dataset['Time_of_service'].fillna(train_dataset['Time_of_service'].mean())
train_dataset['Work_Life_balance']=train_dataset['Work_Life_balance'].fillna(train_dataset['Work_Life_bal ance'].mean())
train_dataset['Pay_Scale']=train_dataset['Pay_Scale'].fillna(train_dataset['Pay_Scale'].mean())
train_dataset['VAR2']=train_dataset['VAR2'].fillna(train_dataset['VAR2'].mean())
train_dataset['VAR4']=train_dataset['VAR4'].fillna(train_dataset['VAR4'].mean())
test_dataset['Age']=test_dataset['Age'].fillna(test_dataset['Age'].mean())
test_dataset['Time_of_service']=test_dataset['Time_of_service'].fillna(test_dataset['Time_of_service'].mean())
test_dataset['Work_Life_balance']=test_dataset['Work_Life_balance'].fillna(test_dataset['Work_Life_balance'].mean())
test_dataset['Pay_Scale']=test_dataset['Pay_Scale'].fillna(test_dataset['Pay_Scale'].mean())
test_dataset['VAR2']=test_dataset['VAR2'].fillna(test_dataset['VAR2'].mean())
test_dataset['VAR4']=test_dataset['VAR4'].fillna(test_dataset['VAR4'].mean())
attributes_to_drop=['Employee_ID','Hometown']
train_dataset=train_dataset.drop(attributes_to_drop,axis=1)
test_dataset=test_dataset.drop(attributes_to_drop,axis=1)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#label encoding
label_encoder = LabelEncoder()
train_dataset.iloc[:,3] = label_encoder.fit_transform(train_dataset.iloc[:,3])
#label_encoder_2 = LabelEncoder()
train_dataset.iloc[:,4] = label_encoder.fit_transform(train_dataset.iloc[:,4])
#label_encoder_3 = LabelEncoder()
train_dataset.iloc[:,5] = label_encoder.fit_transform(train_dataset.iloc[:,5])
#label_encoder_4 = LabelEncoder()
train_dataset.iloc[:,12] = label_encoder.fit_transform(train_dataset.iloc[:,12])
#label_encoder_5 = LabelEncoder()
train_dataset.iloc[:,0] = label_encoder.fit_transform(train_dataset.iloc[:,0])
label_encoder = LabelEncoder()
test_dataset.iloc[:,3] = label_encoder.fit_transform(test_dataset.iloc[:,3])
#label_encoder_2 = LabelEncoder()
test_dataset.iloc[:,4] = label_encoder.fit_transform(test_dataset.iloc[:,4])
#label_encoder_3 = LabelEncoder()
test_dataset.iloc[:,5] = label_encoder.fit_transform(test_dataset.iloc[:,5])
#label_encoder_4 = LabelEncoder()
test_dataset.iloc[:,12] = label_encoder.fit_transform(test_dataset.iloc[:,12])
#label_encoder_5 = LabelEncoder()
test_dataset.iloc[:,0] = label_encoder.fit_transform(test_dataset.iloc[:,0])
x=train_dataset.iloc[:,:-1]
y=train_dataset.iloc[:,-1]
from sklearn.linear_model import LinearRegression
sim_lin_reg = LinearRegression()
sim_lin_reg.fit(x,y)
y_bpred = sim_lin_reg.predict(test_dataset)
print(y_bpred)
sim_lin_reg.score(x,y)#accuracy of model
'''
观察到的精度0.00546752698619779
预期精度0.75或更高
我们如何提高准确性
答案 0 :(得分:0)
这是因为您的先生正在犯一个人可以做的最可怕的错误!
您不适合对测试数据进行转换。您在训练数据集上训练编码,然后使用相同的编码模型来变换训练和测试集。 如果您了解它是如何工作的,它将对您更好。我已经对此发表了一篇文章,希望对您有所帮助。 Handling Categorical Values
label_encoder_1 = LabelEncoder() # first label encoder model
train_dataset.iloc[:,3] = label_encoder_1.fit_transform(train_dataset.iloc[:,3]) # fitting as well as transforming train data
label_encoder_2 = LabelEncoder()
train_dataset.iloc[:,4] = label_encoder_2.fit_transform(train_dataset.iloc[:,4])
test_dataset.iloc[:,3] =
label_encoder_1.transform(test_dataset.iloc[:,3]) # transforming corresponding test column using same label encoder which was fitted already.
test_dataset.iloc[:,4]=
label_encoder_2.transform(test_dataset.iloc[:,4)