我正在尝试进行逻辑回归。数据已被清理,分类变量变为虚拟变量但是当我运行代码时,我从代码之外的“statsmodels”包中收到错误消息,并且在这种情况下不确定如何纠正。
我的一个朋友运行了相同的代码,他得到了一个输出(下面的打印屏幕),因为我使用spyder与python 3.6他认为这可能是一个版本问题 - 他正在使用python 3.5
我有以下代码。关于如何解决它或如何更好地运行逻辑回归的任何想法都值得赞赏。
我收到的错误消息是在statsmodels库中: 文件“C:\ Users \ sebas \ Anaconda3 \ lib \ site-packages \ statsmodels \ discrete \ discrete_model.py”,第2405行,在llr_pvalue中 return stats.chisqprob(self.llr,self.df_model)
AttributeError:模块'scipy.stats'没有属性'chisqprob'
谢谢!
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.cross_validation import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
################################################################################
## Logistic regression
###############################################################################
data = pd.read_csv(r"log reg test Lending club 2007-2011 car only.csv")
#data = data.dropna()
print(data.shape)
##print(list(data.columns))
print(data['Distressed'].value_counts()) ## number of defaulted car loans is binary
sns.countplot(x='Distressed', data=data, palette='hls')
plt.show ## confrim dependent variable is binary
##basic numerical analysis of variables to check feasibility for model
## we will need to create dummy variables for strings
#print(data.groupby('Distressed').mean()) ##numerical variable means
#print(data.groupby('grade').mean()) ## string variable means
#print(data.groupby('sub_grade').mean())
#print(data.groupby('emp_length').mean())
#print(data.groupby('home_ownership').mean())
##testing for nulls in dataset
print(data.isnull().sum())
scrub_data=data.drop(['mths_since_last_delinq'],1) ## this variable is not statistically significant
print('Here is the logit model data')
print(scrub_data.isnull().sum()) ## removed records of missing info, sample still sufficiently large
print(list(scrub_data.columns))
print(scrub_data.head())
##convert categorical variables to dummies completed in csv file
X=scrub_data.ix[:,(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22)].values
y=scrub_data.ix[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.3, random_state=25)
LogReg=LogisticRegression()
LogReg.fit(X_train,y_train)
y_pred=LogReg.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
print('alternative method using RFE')
#y=['Distressed']
#x=[i for i in data if i not in y]
#print(y)
#print(x)
#print(data.info())
## check for independance between features
correlation=sns.heatmap(data.corr()) ## heatmap showing correlations of the variables
print(correlation)
from sklearn.svm import LinearSVC
#logreg = LogisticRegression()
#rfe = RFE(logreg,10)
#rfe=rfe.fit(x,y)
#print(rfe.support_)
#print(rfe.ranking_)
import statsmodels.api as sm
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary())
答案 0 :(得分:1)
可以通过将丢失的函数分配回scipy.stats
命名空间来修复错误,如下所示:
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)