我正在尝试从头开始编写一个多元线性回归模型,以预测影响Facebook上一首歌的观看次数的关键因素。关于每首歌,我们收集这些信息,即我正在使用的变量:
df.dtypes
clicked int64
listened_5s int64
listened_20s int64
views int64
percentage_listened float64
reactions_total int64
shared_songs int64
comments int64
avg_time_listened int64
song_length int64
likes int64
listened_later int64
我使用多个视图作为我的因变量,并将数据集中的所有其他变量用作独立变量。该模型贴在下面:
#df_x --> new dataframe of independent variables
df_x = df.drop(['views'], 1)
#df_y --> new dataframe of dependent variable views
df_y = df.ix[:, ['views']]
names = [i for i in list(df_x)]
regr = linear_model.LinearRegression()
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size = 0.2)
#Fitting the model to the training dataset
regr.fit(x_train,y_train)
regr.intercept_
print('Coefficients: \n', regr.coef_)
print("Mean Squared Error(MSE): %.2f"
% np.mean((regr.predict(x_test) - y_test) ** 2))
print('Variance Score: %.2f' % regr.score(x_test, y_test))
regr.coef_[0].tolist()
此处输出:
regr.intercept_
array([-1173904.20950487])
MSE: 19722838329246.82
Variance Score: 0.99
看起来有些事情发生了错误。
尝试OLS模型:
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
model=sm.OLS(y_train,x_train)
result = model.fit()
print(result.summary())
输出:
R-squared: 0.992
F-statistic: 6121.
coef std err t P>|t| [95.0% Conf. Int.]
clicked 0.3333 0.012 28.257 0.000 0.310 0.356
listened_5s -0.4516 0.115 -3.944 0.000 -0.677 -0.227
listened_20s 1.9015 0.138 13.819 0.000 1.631 2.172
percentage_listened 7693.2520 1.44e+04 0.534 0.594 -2.06e+04 3.6e+04
reactions_total 8.6680 3.561 2.434 0.015 1.672 15.664
shared_songs -36.6376 3.688 -9.934 0.000 -43.884 -29.392
comments 34.9031 5.921 5.895 0.000 23.270 46.536
avg_time_listened 1.702e+05 4.22e+04 4.032 0.000 8.72e+04 2.53e+05
song_length -6309.8021 5425.543 -1.163 0.245 -1.7e+04 4349.413
likes 4.8448 4.194 1.155 0.249 -3.395 13.085
listened_later -2.3761 0.160 -14.831 0.000 -2.691 -2.061
Omnibus: 233.399 Durbin-Watson:
1.983
Prob(Omnibus): 0.000 Jarque-Bera (JB):
2859.005
Skew: 1.621 Prob(JB):
0.00
Kurtosis: 14.020 Cond. No.
2.73e+07
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.73e+07. This might indicate that there are strong multicollinearity or other numerical problems.
通过查看此输出看起来有些事情严重错误。
我认为培训/测试集出现问题并创建两个不同的数据框x和y,但无法弄清楚是什么。必须通过使用多元回归来解决此问题。它不是线性的吗?你能帮我弄清楚出了什么问题吗?