我是Python新手并且是R用户。当我在R中构建它与在iPython中执行相同的操作时,我从简单的回归模型得到非常不同的结果。
R-Squared,P值,系数的重要性 - 没有任何匹配。我是在读输出错误还是做出其他一些基本错误?
以下是我的代码和结果:
R代码
str(df_nv)
Classes 'tbl_df', 'tbl' and 'data.frame': 81 obs. of 2 variables:
$ Dependent Variabls : num 733 627 405 353 434 556 381 558 612 901 ...
$ Independent Variable: num 0.193 0.167 0.169 0.14 0.145 ...
summary(lm(`Dependent Variable` ~ `Independent Variable`, data = df_nv))
Call:
lm(formula = `Dependent Variable` ~ `Independent Variable`, data = df_nv)
Residuals:
Min 1Q Median 3Q Max
-501.18 -139.20 -82.61 -15.82 2136.74
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 478.2 148.2 3.226 0.00183 **
`Independent Variable` -196.1 1076.9 -0.182 0.85601
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 381.5 on 79 degrees of freedom
Multiple R-squared: 0.0004194, Adjusted R-squared: -0.01223
F-statistic: 0.03314 on 1 and 79 DF, p-value: 0.856
iPython Notebook Code
df_nv.dtypes
Dependent Variable float64
Independent Variable float64
dtype: object
model = sm.OLS(df_nv['Dependent Variable'], df_nv['Independent Variable'])
results = model.fit()
results.summary()
OLS Regression Results
Dep. Variable: Dependent Variable R-squared: 0.537
Model: OLS Adj. R-squared: 0.531
Method: Least Squares F-statistic: 92.63
Date: Fri, 20 Jan 2017 Prob (F-statistic): 5.23e-15
Time: 09:08:54 Log-Likelihood: -600.40
No. Observations: 81 AIC: 1203.
Df Residuals: 80 BIC: 1205.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Independent Variable 3133.1825 325.537 9.625 0.000 2485.342 3781.023
Omnibus: 89.595 Durbin-Watson: 1.940
Prob(Omnibus): 0.000 Jarque-Bera (JB): 980.289
Skew: 3.489 Prob(JB): 1.36e-213
Kurtosis: 18.549 Cond. No. 1.00
作为参考,R和Python中的数据框头:
R:
head(df_nv)
Dependent Variable Independent Variable
<dbl> <dbl>
1 733 0.1932367
2 627 0.1666667
3 405 0.1686183
4 353 0.1398601
5 434 0.1449275
6 556 0.1475410
Python:
df_nv.head()
Dependent Variable Independent Variable
5292 733.0 0.193237
5320 627.0 0.166667
5348 405.0 0.168618
5404 353.0 0.139860
5460 434.0 0.144928
答案 0 :(得分:4)
以下是使用gapminder
(使用python pandas
)和statsmodels.formula.api
对R
数据集运行线性回归的结果,它们完全相同:
df <- read.csv('gapminder.csv')
df <- df[c('internetuserate', 'urbanrate')]
df <- df[complete.cases(df),]
dim(df)
# [1] 190 2
m <- lm(internetuserate~urbanrate, df)
summary(m)
#Call:
#lm(formula = internetuserate ~ urbanrate, data = df)
#Residuals:
# Min 1Q Median 3Q Max
#-51.474 -15.857 -3.954 14.305 74.590
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -4.90375 4.11485 -1.192 0.235
#urbanrate 0.72022 0.06753 10.665 <2e-16 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 22.03 on 188 degrees of freedom
#Multiple R-squared: 0.3769, Adjusted R-squared: 0.3736
#F-statistic: 113.7 on 1 and 188 DF, p-value: < 2.2e-16
import pandas
import statsmodels.formula.api as smf
data = pandas.read_csv('gapminder.csv')
data = data[['internetuserate', 'urbanrate']]
data['internetuserate'] = pandas.to_numeric(data['internetuserate'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
data = data.dropna(axis=0, how='any')
print data.shape
# (190, 2)
reg1 = smf.ols('internetuserate ~ urbanrate', data=data).fit()
print (reg1.summary())
# OLS Regression Results
#==============================================================================
#Dep. Variable: internetuserate R-squared: 0.377
#Model: OLS Adj. R-squared: 0.374
#Method: Least Squares F-statistic: 113.7
#Date: Fri, 20 Jan 2017 Prob (F-statistic): 4.56e-21
#Time: 23:27:50 Log-Likelihood: -856.14
#No. Observations: 190 AIC: 1716.
#Df Residuals: 188 BIC: 1723.
#Df Model: 1
#Covariance Type: nonrobust
#================================================================================
# coef std err t P>|t| [95.0% Conf. Int.]
# ------------------------------------------------------------------------------
# Intercept -4.9037 4.115 -1.192 0.235 -13.021 3.213
# urbanrate 0.7202 0.068 10.665 0.000 0.587 0.853
#================================================================================
# Omnibus: 10.750 Durbin-Watson: 2.097
# Prob(Omnibus): 0.005 Jarque-Bera (JB): 10.990
# Skew: 0.574 Prob(JB): 0.00411
# Kurtosis: 3.262 Cond. No. 157.
#==============================================================================