将sklearn LogisticRegression系数链接到稀疏矩阵中的项,并获得统计显着性/ C.I.

时间:2015-02-23 05:50:50

标签: python scikit-learn logistic-regression coefficients

这是从another thread开始的问题的延续。

我使用sklearn运行逻辑回归,使用类似下面的代码:

from pandas import *
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import linear_model

vect= CountVectorizer(binary =True)

a = read_table('text.tsv', sep='\t', index_col=False)

X = vect.fit_transform(c['text'].values)

logreg = linear_model.LogisticRegression(C=1)

d = logreg.fit(X, c['label'])
d.coef_

现在我想将d.coef_中的值链接到构成稀疏矩阵X中行的唯一术语。这样做的正确方法是什么?似乎无法使其工作,即使看起来X应该具有词汇表属性。我明白了:

In [48]: X.vocabulary_
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-48-138ab7dd95ed> in <module>()
----> 1 X.vocabulary_

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/sparse/base.pyc in __getattr__(self, attr)
    497             return self.getnnz()
    498         else:
--> 499             raise AttributeError(attr + " not found")
    500 
    501     def transpose(self):

AttributeError: vocabulary_ not found

更进一步,如果我想获得这些系数的统计显着性和置信区间(沿着你从R&#39的glm得到的线),那可能吗?如,

## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = mydata)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.627  -0.866  -0.639   1.149   2.079  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.98998    1.13995   -3.50  0.00047 ***
## gre          0.00226    0.00109    2.07  0.03847 *  
## gpa          0.80404    0.33182    2.42  0.01539 *  
## rank2       -0.67544    0.31649   -2.13  0.03283 *  
## rank3       -1.34020    0.34531   -3.88  0.00010 ***
## rank4       -1.55146    0.41783   -3.71  0.00020 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.5
## 
## Number of Fisher Scoring iterations: 4

1 个答案:

答案 0 :(得分:2)

可以使用vect方法从get_feature_names访问功能名称。

你可以将它们压缩成这样的系数,例如:

zip(vect.get_feature_names(),d.coef_[0]) 

这将返回带有(token, coefficient)

的元组