无法运行logit模型/逻辑回归

时间:2018-03-29 13:29:07

标签: python-3.x

我正在尝试进行逻辑回归。数据已被清理,分类变量变为虚拟变量但是当我运行代码时,我从代码之外的“statsmodels”包中收到错误消息,并且在这种情况下不确定如何纠正。

我的一个朋友运行了相同的代码,他得到了一个输出(下面的打印屏幕),因为我使用spyder与python 3.6他认为这可能是一个版本问题 - 他正在使用python 3.5

我有以下代码。关于如何解决它或如何更好地运行逻辑回归的任何想法都值得赞赏。

我收到的错误消息是在statsmodels库中: 文件“C:\ Users \ sebas \ Anaconda3 \ lib \ site-packages \ statsmodels \ discrete \ discrete_model.py”,第2405行,在llr_pvalue中     return stats.chisqprob(self.llr,self.df_model)

AttributeError:模块'scipy.stats'没有属性'chisqprob'

谢谢!

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.cross_validation import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)


################################################################################

## Logistic regression

###############################################################################


data = pd.read_csv(r"log reg test Lending club 2007-2011 car only.csv")
#data = data.dropna()
print(data.shape)
##print(list(data.columns))

print(data['Distressed'].value_counts()) ## number of defaulted car loans is binary

sns.countplot(x='Distressed', data=data, palette='hls')
plt.show ## confrim dependent variable is binary


##basic numerical analysis of variables to check feasibility for model
## we will need to create dummy variables for strings
#print(data.groupby('Distressed').mean()) ##numerical variable means
#print(data.groupby('grade').mean()) ## string variable means
#print(data.groupby('sub_grade').mean())
#print(data.groupby('emp_length').mean())
#print(data.groupby('home_ownership').mean())

##testing for nulls in dataset
print(data.isnull().sum()) 
scrub_data=data.drop(['mths_since_last_delinq'],1) ## this variable is not statistically significant

print('Here is the logit model data')
print(scrub_data.isnull().sum()) ## removed records of missing info, sample still sufficiently large

print(list(scrub_data.columns))
print(scrub_data.head())

##convert categorical variables to dummies completed in csv file

X=scrub_data.ix[:,(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22)].values
y=scrub_data.ix[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.3, random_state=25) 

LogReg=LogisticRegression()
LogReg.fit(X_train,y_train)

y_pred=LogReg.predict(X_test)

from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

print('alternative method using RFE')

#y=['Distressed']
#x=[i for i in data if i not in y]
#print(y)
#print(x)
#print(data.info())

## check for independance between features

correlation=sns.heatmap(data.corr()) ## heatmap showing correlations of the variables
print(correlation)

from sklearn.svm import LinearSVC
#logreg = LogisticRegression()
#rfe = RFE(logreg,10)
#rfe=rfe.fit(x,y)
#print(rfe.support_)
#print(rfe.ranking_)

import statsmodels.api as sm
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary())

1 个答案:

答案 0 :(得分:1)

可以通过将丢失的函数分配回scipy.stats命名空间来修复错误,如下所示:

from scipy import stats

stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)