如何在给定数据上正确构建模型以预测目标参数?

时间:2019-03-15 14:03:15

标签: machine-learning scikit-learn regression probability

我有一些包含不同参数的数据集,data.head()看起来像这样

enter image description here

应用了一些预处理并进行了功能排名-

dataset = pd.read_csv("ML.csv",header = 0)

#Get dataset breif
print(dataset.shape)
print(dataset.isnull().sum())
#print(dataset.head())

#Data Pre-processing
data = dataset.drop('organization_id',1)
data = data.drop('status',1)
data = data.drop('city',1)

#Find median for features having NaN
median_zip, median_role_id, median_specialty_id, median_latitude, median_longitude = data['zip'].median(),data['role_id'].median(),data['specialty_id'].median(),data['latitude'].median(),data['longitude'].median() 
data['zip'].fillna(median_zip, inplace=True)
data['role_id'].fillna(median_role_id, inplace=True)
data['specialty_id'].fillna(median_specialty_id, inplace=True)
data['latitude'].fillna(median_latitude, inplace=True)
data['longitude'].fillna(median_longitude, inplace=True)

#Fill YearOFExp with 0
data['years_of_experience'].fillna(0, inplace=True)
target = dataset.location_id

#Perform Recursive Feature Extraction
svm = LinearSVC()
rfe = RFE(svm, 1)
rfe = rfe.fit(data, target) #IT give convergence Warning - Normally when an optimization algorithm does not converge, it is usually because the problem is not well-conditioned, perhaps due to a poor scaling of the decision variables. 


names = list(data)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names)))

输出

按得分排序的功能:

[(1, 'location_id'), (2, 'department_id'), (3, 'latitude'), (4, 'specialty_id'), (5, 'longitude'), (6, 'zip'), (7, 'shift_id'), (8, 'user_id'), (9, 'role_id'), (10, 'open_positions'), (11, 'years_of_experience')]

据此我了解到哪些参数更重要。 上述处理是否正确对理解此功能很重要。如何使用以上信息进行更好的模型训练?

当我对训练进行建模时,它具有很高的准确性。为何如此精确?

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

dataset = pd.read_csv("prod_data_for_ML.csv",header = 0)

#Data Pre-processing
data = dataset.drop('location_id',1)
data = data.drop('status',1)
data = data.drop('city',1)

#Find median for features having NaN
median_zip, median_role_id, median_specialty_id, median_latitude, median_longitude = data['zip'].median(),data['role_id'].median(),data['specialty_id'].median(),data['latitude'].median(),data['longitude'].median() 
data['zip'].fillna(median_zip, inplace=True)
data['role_id'].fillna(median_role_id, inplace=True)
data['specialty_id'].fillna(median_specialty_id, inplace=True)
data['latitude'].fillna(median_latitude, inplace=True)
data['longitude'].fillna(median_longitude, inplace=True)

#Fill YearOFExp with 0
data['years_of_experience'].fillna(0, inplace=True)

#Start training

labels = dataset.location_id
train1 = data
algo = LinearRegression()
x_train , x_test , y_train , y_test = train_test_split(train1 , labels , test_size = 0.20,random_state =1)

# x_train.to_csv("x_train.csv", sep=',', encoding='utf-8')
# x_test.to_csv("x_test.csv", sep=',', encoding='utf-8')

algo.fit(x_train,y_train)
algo.score(x_test,y_test)

输出

0.981150074104111

from sklearn import ensemble
clf = ensemble.GradientBoostingRegressor(n_estimators = 400, max_depth = 5, min_samples_split = 2,
          learning_rate = 0.1, loss = 'ls')
clf.fit(x_train, y_train)
clf.score(x_test,y_test)

输出-

0.99

我做错什么了吗?为这种情况建立模型的正确方法是什么?

我知道有一些方法可以使每个参数获得Precision,Recall和f1。谁能给我参考链接来执行此操作?

0 个答案:

没有答案