我有一个使用scikit-learn库的python机器学习项目。我有两个用于训练和测试的分离数据集,我尝试进行线性回归。我使用下面显示的代码块:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
from pylab import rcParams
import urllib
import sklearn
from sklearn.linear_model import LinearRegression
df =pd.read_csv("TrainingData.csv")
df2=pd.read_csv("TestingData.csv")
df['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df['Development_platform']]
df['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df['Language_Type']]
df2['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df2['Development_platform']]
df2['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df2['Language_Type']]
X_train = df[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_train = df['Effort']
X_test=df2[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_test=df2['Effort']
lr = LinearRegression().fit(X_train, Y_train)
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
print("Training set score: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set score: {:.7f}".format(lr.score(X_test, Y_test)))
我的结果是:
lr.coef_:[2.32088001e + 00 2.07441948e-12 -4.73338567e-05 6.79658129e + 02]
lr.intercept_:2166.186033098048
训练集得分:0.63
测试集得分:0.5732999
你有什么建议我?如何提高准确度? (添加代码,参数等) 我的数据集在这里:https://yadi.sk/d/JJmhzfj-3QCV4V
答案 0 :(得分:16)
我将通过一些例子详细阐述@ GeorgiKaradjov的回答。您的问题非常广泛,并且有多种方法可以获得改进。最后,拥有领域知识(上下文)将为您提供获得改进的最佳机会。
1)规范化数据
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
afp = np.append(X_train['AFP'].values, X_test['AFP'].values)
std.fit(afp)
X_train[['AFP']] = std.transform(X_train['AFP'])
X_test[['AFP']] = std.transform(X_test['AFP'])
给出
0 0.752395
1 0.008489
2 -0.381637
3 -0.020588
4 0.171446
Name: AFP, dtype: float64
2)分类功能编码
def feature_engineering(df):
dev_plat = pd.get_dummies(df['Development_platform'], prefix='dev_plat')
df[dev_plat.columns] = dev_plat
df = df.drop('Development_platform', axis=1)
lang_type = pd.get_dummies(df['Language_Type'], prefix='lang_type')
df[lang_type.columns] = lang_type
df = df.drop('Language_Type', axis=1)
resource_level = pd.get_dummies(df['Resource_Level'], prefix='resource_level')
df[resource_level.columns] = resource_level
df = df.drop('Resource_Level', axis=1)
return df
X_train = feature_engineering(X_train)
X_train.head(5)
给出
AFP dev_plat_077070 dev_plat_077082 dev_plat_077117108116105 dev_plat_080067 lang_type_051071076 lang_type_052071076 lang_type_065112071 resource_level_1 resource_level_2 resource_level_4
0 0.752395 1 0 0 0 1 0 0 1 0 0
1 0.008489 0 0 1 0 0 1 0 1 0 0
2 -0.381637 0 0 1 0 0 1 0 1 0 0
3 -0.020588 0 0 1 0 1 0 0 1 0 0
3)特色工程;共线性强>
import seaborn as sns
corr = X_train.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True)
您想要y=x
的红线,因为值应与自身相关联。但是,任何红色或蓝色列都显示出强烈的相关性/反相关性,需要进行更多调查。例如,Resource = 1,Resource = 4,在某种意义上可能是高度相关的,如果人们拥有1,那么拥有4的机会就会减少等等。回归假定所使用的参数彼此独立。
3)特征工程;更高阶的术语
也许您的模型过于简单,您可以考虑添加更高阶和交叉项:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2, interaction_only=True)
output_nparray = poly.fit_transform(df)
target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(df.columns, p) for p in poly.powers_]]
output_df = pd.DataFrame(output_nparray, columns=target_feature_names)
我快速尝试了这一点,我不认为更高阶的条款会有所帮助。您的数据也可能非线性,快速logarithm
或Y输出更适合,表明它是线性的。你也可以看看实际情况,但我太懒了......
4)正规化
尝试使用sklearn的RidgeRegressor并使用alpha:
lr = RidgeCV(alphas=np.arange(70,100,0.1), fit_intercept=True)
5)替代模式
有时线性回归并不总是适合。例如,随机森林回归器可以很好地执行,并且通常对标准化的数据不敏感,并且是分类/连续的。其他模型包括XGBoost和Lasso(具有L1正则化的线性回归)。
lr = RandomForestRegressor(n_estimators=100)
全部放在一起
我被带走并开始研究你的问题,但如果不了解这些功能的所有背景,就无法改善它:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
from pylab import rcParams
import urllib
import sklearn
from sklearn.linear_model import RidgeCV, LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import GridSearchCV
def feature_engineering(df):
dev_plat = pd.get_dummies(df['Development_platform'], prefix='dev_plat')
df[dev_plat.columns] = dev_plat
df = df.drop('Development_platform', axis=1)
lang_type = pd.get_dummies(df['Language_Type'], prefix='lang_type')
df[lang_type.columns] = lang_type
df = df.drop('Language_Type', axis=1)
resource_level = pd.get_dummies(df['Resource_Level'], prefix='resource_level')
df[resource_level.columns] = resource_level
df = df.drop('Resource_Level', axis=1)
return df
df = pd.read_csv("TrainingData.csv")
df2 = pd.read_csv("TestingData.csv")
df['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df['Development_platform']]
df['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df['Language_Type']]
df2['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df2['Development_platform']]
df2['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df2['Language_Type']]
X_train = df[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_train = df['Effort']
X_test = df2[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_test = df2['Effort']
std = StandardScaler()
afp = np.append(X_train['AFP'].values, X_test['AFP'].values)
std.fit(afp)
X_train[['AFP']] = std.transform(X_train['AFP'])
X_test[['AFP']] = std.transform(X_test['AFP'])
X_train = feature_engineering(X_train)
X_test = feature_engineering(X_test)
lr = RandomForestRegressor(n_estimators=50)
lr.fit(X_train, Y_train)
print("Training set score: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, Y_test)))
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(Y_test, y_pred, fmt='o')
ax.errorbar([1, Y_test.max()], [1, Y_test.max()])
导致:
Training set score: 0.90
Test set score: 0.61
您可以查看变量的重要性(更高的值,更重要)。
Importance
AFP 0.882295
dev_plat_077070 0.020817
dev_plat_077082 0.001162
dev_plat_077117108116105 0.016334
dev_plat_080067 0.004077
lang_type_051071076 0.012458
lang_type_052071076 0.021195
lang_type_065112071 0.001118
resource_level_1 0.012644
resource_level_2 0.006673
resource_level_4 0.021227
你可以开始查看超参数以获得改进:http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
答案 1 :(得分:1)
这里有一些提示:
数据准备(探索)是机器学习项目中最重要的步骤之一,您需要从它开始。
你清理数据了吗?如果不是从那一步开始!正如this tutorial所说:
没有数据探索的快捷方式。如果你处于状态 记住,机器学习可以让你远离每一次数据风暴, 相信我,它不会。经过一段时间,你会意识到你 正在努力提高模型的准确性。在这种情况下,数据 探索技术将为您解救。
这是数据探索的一些步骤:
缺失值处理,
离群移除
特色工程
之后尝试使用您的功能进行单变量和双变量分析。
使用one hot编码将您的分类功能转换为数字功能。
这就是您所需要的。
here是关于如何处理分类变量的教程,来自sklearn的one-hot encoding学习是解决问题的最佳技巧。
使用ASCII表示不是处理分类特征的最佳实践
您可以在here中找到有关数据探索的更多信息 按照我给你的建议,稍后再谢谢你。
答案 2 :(得分:0)