如何在Python中实现多元线性回归?

时间:2018-01-15 04:58:04

标签: python pandas machine-learning statistics regression

我正在尝试从头开始编写一个多元线性回归模型,以预测影响Facebook上一首歌的观看次数的关键因素。关于每首歌,我们收集这些信息,即我正在使用的变量:

df.dtypes
clicked                      int64
listened_5s                  int64
listened_20s                 int64
views                        int64
percentage_listened          float64
reactions_total              int64
shared_songs                 int64
comments                     int64
avg_time_listened            int64
song_length                  int64
likes                        int64
listened_later               int64

我使用多个视图作为我的因变量,并将数据集中的所有其他变量用作独立变量。该模型贴在下面:

  #df_x --> new dataframe of independent variables
  df_x = df.drop(['views'], 1)

  #df_y --> new dataframe of dependent variable views
  df_y = df.ix[:, ['views']]

  names = [i for i in list(df_x)]

  regr = linear_model.LinearRegression()
  x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size = 0.2)

   #Fitting the model to the training dataset
   regr.fit(x_train,y_train)
   regr.intercept_
   print('Coefficients: \n', regr.coef_)
   print("Mean Squared Error(MSE): %.2f"
         % np.mean((regr.predict(x_test) - y_test) ** 2))
   print('Variance Score: %.2f' % regr.score(x_test, y_test))
   regr.coef_[0].tolist()

此处输出:

 regr.intercept_
 array([-1173904.20950487])
 MSE: 19722838329246.82
 Variance Score: 0.99

看起来有些事情发生了错误。

尝试OLS模型:

   import statsmodels.api as sm
   from statsmodels.sandbox.regression.predstd import wls_prediction_std
   model=sm.OLS(y_train,x_train)
   result = model.fit()
   print(result.summary())

输出:

     R-squared:                       0.992
     F-statistic:                     6121.   

                      coef        std err      t      P>|t|      [95.0% Conf. Int.]


clicked                0.3333      0.012     28.257      0.000         0.310     0.356
listened_5s            -0.4516      0.115    -3.944      0.000        -0.677    -0.227
listened_20s           1.9015      0.138     13.819      0.000         1.631     2.172
percentage_listened    7693.2520   1.44e+04   0.534      0.594     -2.06e+04   3.6e+04
reactions_total        8.6680      3.561      2.434      0.015         1.672    15.664
shared_songs         -36.6376      3.688     -9.934      0.000       -43.884   -29.392
comments              34.9031      5.921      5.895      0.000        23.270    46.536
avg_time_listened    1.702e+05   4.22e+04     4.032      0.000      8.72e+04  2.53e+05
song_length         -6309.8021   5425.543    -1.163      0.245      -1.7e+04  4349.413
likes                  4.8448      4.194      1.155      0.249        -3.395    13.085
listened_later        -2.3761      0.160    -14.831      0.000        -2.691    -2.061


Omnibus:                      233.399   Durbin-Watson:                   
1.983
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             
2859.005
Skew:                           1.621   Prob(JB):                         
0.00
Kurtosis:                      14.020   Cond. No.                     
2.73e+07

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.73e+07. This might indicate that there are strong multicollinearity or other numerical problems.

通过查看此输出看起来有些事情严重错误。

我认为培训/测试集出现问题并创建两个不同的数据框x和y,但无法弄清楚是什么。必须通过使用多元回归来解决此问题。它不是线性的吗?你能帮我弄清楚出了什么问题吗?

0 个答案:

没有答案