scikit-learn逻辑回归特征的重要性

时间:2018-04-13 09:07:38

标签: python scikit-learn logistic-regression

我正在寻找一种方法来了解我在分类问题中使用的功能的影响。使用sklearn的逻辑回归分类器(http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html),我理解.coef_属性可以获取我所追求的信息(也在此主题中讨论过:How to find the importance of the features for a logistic regression model?)。

我的矩阵的前几行:

public static class Common
{
    public static string dbConnStr = ConfigurationManager.ConnectionStrings["dbConnString"].ToString();

    public static List<T> ToList<T>(DataTable table)
    {
        List<T> list = new List<T>();
        T item;
        Type listItemType = typeof(T);
        for (int i = 0; i < table.Rows.Count; i++)
        {
            item = (T)Activator.CreateInstance(listItemType);
            MapRow(item, table, listItemType, i);
            list.Add(item);
        }
        return list;
    }

    private static void MapRow(object vOb, DataTable table, Type type, int row)
    {
        for (int col = 0; col < table.Columns.Count; col++)
        {
            var columnName = table.Columns[col].ColumnName;
            var prop = type.GetProperty(columnName);
            object data = table.Rows[row][col];

            if (data == System.DBNull.Value)
            {
            }
            else
                prop.SetValue(vOb, data, null);
        }

    }
}

第一行是标题,后跟数据(在我的代码中使用预处理器的LabelEncoder将其转换为整数)。

现在,当我做一个

phrase_type,type,complex_np,np_form,referentiality,grammatical_role,ambiguity,anaphor_type,dir_speech,length_of_span,length_of_coref_chain,position_in_coref_chain,position_in_sentence,is_topic
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,1,-1,18,True
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,2,1,1,True
np,none,no,defnp,discourse-new,sbj,not_ambig,_unspecified_,text_level,2,1,-1,9,True

我得到了

print(classifier.coef_)

包含12列/元素。我对此感到困惑,因为我的数据包含13列(加上带有标签的第14列),我将稍后在我的代码中将这些功能与标签分开)。 我想知道是否sklearn期望/假设第一列是id并且实际上不使用此列的值?但我找不到任何关于此的信息。

非常感谢任何帮助!

1 个答案:

答案 0 :(得分:0)

不确定如何以对未来参考仍有意义的方式编辑原始问题,因此我将在此处发布一个最小示例:

import pandas
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
from collections import defaultdict
import numpy

headers = ['phrase_type','type','complex_np','np_form','referentiality','grammatical_role','ambiguity','anaphor_type','dir_speech','length_of_span','length_of_coref_chain','position_in_coref_chain','position_in_sentence','is_topic']
matrix = [
['np','none','no','no,pds','referring','dir-obj','not_ambig','_unspecified_','text_level','1','1','-1','1','True'],
['np','none','no','pds','not_specified','sbj','not_ambig','_unspecified_','text_level','1','1','-1','21','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','1','-1','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','2','0','6','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','2','0','4','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','21','1','-1','1','True'],
['np','anaphoric','no','ne','referring','other','not_ambig','anaphor_nominal','text_level','1','9','4','2','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','3','9','5','1','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','9','7','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','2','1','1','True'],
['np','anaphoric','no','ne','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','2','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','9','1','13','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','3','0','5','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','1','-1','1','False'],
['np','none','no','ne','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','9','0','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','5','1','-1','5','False'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','1','5','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','3','3','0','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','3','1','1','True'],
['np','anaphoric','no','pds','referring','sbj','not_ambig','anaphor_nominal','text_level','1','1','-1','2','True']
]


df = pandas.DataFrame(matrix, columns=headers)
d = defaultdict(LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))
df = df.apply(lambda x: d[x.name].transform(x))

testrows = []
trainrows = []
splitIndex = len(matrix)/10
for index, row in df.iterrows():
    if index < splitIndex:
        testrows.append(row)
    else:
        trainrows.append(row)
testdf = pandas.DataFrame(testrows)
traindf = pandas.DataFrame(trainrows)
train_labels = traindf.is_topic
labels = list(set(train_labels))
train_labels = numpy.array([labels.index(x) for x in train_labels])
train_features = traindf.iloc[:,0:len(headers)-1]
train_features = numpy.array(train_features)
print('train features shape:', train_features.shape)
test_labels = testdf.is_topic
labels = list(set(test_labels))
test_labels = numpy.array([labels.index(x) for x in test_labels])
test_features = testdf.iloc[:,0:len(headers)-1]
test_features = numpy.array(test_features)

classifier = LogisticRegression()
classifier.fit(train_features, train_labels)
print(classifier.coef_)
results = classifier.predict(test_features)
f1 = f1_score(test_labels, results)
print(f1)

我想我可能已经找到了错误的来源(感谢@Alexey Trofimov指出我正确的方向)。我的代码最初包含:

train_features = traindf.iloc[:,1:len(headers)-1]

这是从另一个脚本复制的,我确实将id作为我矩阵中的第一列,因此我们不想将这些考虑在内。 len(header)-1然后,如果我理解正确的话,就是不考虑实际的标签。在真实场景中测试这一点,删除-1会产生完美的f分数,这是有意义的,因为它只会查看实际标签并始终正确预测... 所以我现在改为

train_features = traindf.iloc[:,0:len(headers)-1]

就像在代码片段中一样,现在得到13列(在X_train.shape中,因此在classifier.coef_中)。 我认为这解决了我的问题,但仍然没有100%说服,所以如果有人能在上面的推理/我的代码中指出错误,我会很高兴听到它。