当前,尝试从具有48098行和70个要素(所有类型)的数据框中预测边距(百分比)。此后,获取虚拟变量已完成,以仅具有数值,该数值具有以下形状(形状Df为:(48098,572))。
但是,从探索步骤开始,我们可以看到目标并没有真正遵循正态分布(如图所示)。
因此,训练和测试集的性能分别为0.76和0.74。
尝试了一些解决方案,例如:
因此,已尝试实现多项式回归。当函数(PolynomialFeatures)适合训练集时,就会出现问题。实际上,出现以下错误:
MemoryError Traceback (most recent call last)
<ipython-input-19-bf8dbdba1272> in <module>
3 poly = PolynomialFeatures(degree=2, include_bias=False)
4 poly = poly.fit(X_train)
----> 5 X_poly = poly.transform(X_train)
~\AppData\Local\Continuum\anaconda3\Anaconda\lib\site-packages\sklearn\preprocessing\data.py in transform(self, X)
1504 XP = sparse.hstack(columns, dtype=X.dtype).tocsc()
1505 else:
-> 1506 XP = np.empty((n_samples, self.n_output_features_ dtype=X.dtype)
1507 for i, comb in enumerate(combinations):
1508 XP[:, i] = X[:, comb].prod(1)
MemoryError:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-25-e379555bc257> in <module>
5 # the default "include_bias=True" adds a feature that's constantly 1
6 poly = PolynomialFeatures(degree=2, include_bias=False)
----> 7 poly = poly.fit(X)
8 X_poly = poly.transform(X)
1458 self : instance
1459 """
-> 1460 n_samples, n_features = check_array(X, accept_sparse=True).shape
1461 combinations = self._combinations(n_features, self.degree,
1462 self.interaction_only,
565 # make sure we actually converted to numeric:
566 if dtype_numeric and array.dtype.kind == "O":
--> 567 array = array.astype(np.float64)
568 if not allow_nd and array.ndim >= 3:
569 raise ValueError("Found array with dim %d. %s expected <= 2."
ValueError: could not convert string to float: 'Loans'
首先:多项式
from sklearn.preprocessing import PolynomialFeatures
y = df.MARGIN
X = df.drop('MARGIN', axis=1)
poly = PolynomialFeatures(degree=2, include_bias=False)
poly = poly.fit(X)
X_poly = poly.transform(X)
第二:训练/测试分组
from sklearn.model_selection import train_test_split
y = df_ohe.MARGIN
X = df_ohe.drop('MARGIN', axis=1)
# Split into Tain and Test set
X_train,X_test,y_train,y_test = train_test_split (X, y, test_size=0.25, random_state=0)
预期结果将是完成多项式回归并具有类似以下内容:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=10, include_bias=False)
poly.fit(X)
X_poly = poly.transform(X)
X_train和X_test是否具有多项式特征?
from sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(X_train, y_train)
lr_pred = lr.predict(X_test)
train_R2_lr = lr.score(X_train, y_train)
test_R2_lr = lr.score(X_train, y_test)
print("Training set score: {:.2f}".format(train_R2_lr))
print("Test set score: {:.2f}".format(test_R2_lr))
如果您有任何建议,请随时与我们分享并感谢那些需要时间的人。
白天/夜晚都很好!