我正在寻找一种方法来了解我在分类问题中使用的功能的影响。使用sklearn的逻辑回归分类器(http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html),我理解.coef_属性可以获取我所追求的信息(也在此主题中讨论过:How to find the importance of the features for a logistic regression model?)。
我的矩阵的前几行:
public static class Common
{
public static string dbConnStr = ConfigurationManager.ConnectionStrings["dbConnString"].ToString();
public static List<T> ToList<T>(DataTable table)
{
List<T> list = new List<T>();
T item;
Type listItemType = typeof(T);
for (int i = 0; i < table.Rows.Count; i++)
{
item = (T)Activator.CreateInstance(listItemType);
MapRow(item, table, listItemType, i);
list.Add(item);
}
return list;
}
private static void MapRow(object vOb, DataTable table, Type type, int row)
{
for (int col = 0; col < table.Columns.Count; col++)
{
var columnName = table.Columns[col].ColumnName;
var prop = type.GetProperty(columnName);
object data = table.Rows[row][col];
if (data == System.DBNull.Value)
{
}
else
prop.SetValue(vOb, data, null);
}
}
}
第一行是标题,后跟数据(在我的代码中使用预处理器的LabelEncoder将其转换为整数)。
现在,当我做一个
phrase_type,type,complex_np,np_form,referentiality,grammatical_role,ambiguity,anaphor_type,dir_speech,length_of_span,length_of_coref_chain,position_in_coref_chain,position_in_sentence,is_topic
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,1,-1,18,True
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,2,1,1,True
np,none,no,defnp,discourse-new,sbj,not_ambig,_unspecified_,text_level,2,1,-1,9,True
我得到了
print(classifier.coef_)
包含12列/元素。我对此感到困惑,因为我的数据包含13列(加上带有标签的第14列),我将稍后在我的代码中将这些功能与标签分开)。 我想知道是否sklearn期望/假设第一列是id并且实际上不使用此列的值?但我找不到任何关于此的信息。
非常感谢任何帮助!
答案 0 :(得分:0)
不确定如何以对未来参考仍有意义的方式编辑原始问题,因此我将在此处发布一个最小示例:
import pandas
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
from collections import defaultdict
import numpy
headers = ['phrase_type','type','complex_np','np_form','referentiality','grammatical_role','ambiguity','anaphor_type','dir_speech','length_of_span','length_of_coref_chain','position_in_coref_chain','position_in_sentence','is_topic']
matrix = [
['np','none','no','no,pds','referring','dir-obj','not_ambig','_unspecified_','text_level','1','1','-1','1','True'],
['np','none','no','pds','not_specified','sbj','not_ambig','_unspecified_','text_level','1','1','-1','21','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','1','-1','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','2','0','6','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','2','0','4','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','21','1','-1','1','True'],
['np','anaphoric','no','ne','referring','other','not_ambig','anaphor_nominal','text_level','1','9','4','2','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','3','9','5','1','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','9','7','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','2','1','1','True'],
['np','anaphoric','no','ne','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','2','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','9','1','13','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','3','0','5','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','1','-1','1','False'],
['np','none','no','ne','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','9','0','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','5','1','-1','5','False'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','1','5','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','3','3','0','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','3','1','1','True'],
['np','anaphoric','no','pds','referring','sbj','not_ambig','anaphor_nominal','text_level','1','1','-1','2','True']
]
df = pandas.DataFrame(matrix, columns=headers)
d = defaultdict(LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))
df = df.apply(lambda x: d[x.name].transform(x))
testrows = []
trainrows = []
splitIndex = len(matrix)/10
for index, row in df.iterrows():
if index < splitIndex:
testrows.append(row)
else:
trainrows.append(row)
testdf = pandas.DataFrame(testrows)
traindf = pandas.DataFrame(trainrows)
train_labels = traindf.is_topic
labels = list(set(train_labels))
train_labels = numpy.array([labels.index(x) for x in train_labels])
train_features = traindf.iloc[:,0:len(headers)-1]
train_features = numpy.array(train_features)
print('train features shape:', train_features.shape)
test_labels = testdf.is_topic
labels = list(set(test_labels))
test_labels = numpy.array([labels.index(x) for x in test_labels])
test_features = testdf.iloc[:,0:len(headers)-1]
test_features = numpy.array(test_features)
classifier = LogisticRegression()
classifier.fit(train_features, train_labels)
print(classifier.coef_)
results = classifier.predict(test_features)
f1 = f1_score(test_labels, results)
print(f1)
我想我可能已经找到了错误的来源(感谢@Alexey Trofimov指出我正确的方向)。我的代码最初包含:
train_features = traindf.iloc[:,1:len(headers)-1]
这是从另一个脚本复制的,我确实将id作为我矩阵中的第一列,因此我们不想将这些考虑在内。 len(header)-1然后,如果我理解正确的话,就是不考虑实际的标签。在真实场景中测试这一点,删除-1会产生完美的f分数,这是有意义的,因为它只会查看实际标签并始终正确预测... 所以我现在改为
train_features = traindf.iloc[:,0:len(headers)-1]
就像在代码片段中一样,现在得到13列(在X_train.shape中,因此在classifier.coef_中)。 我认为这解决了我的问题,但仍然没有100%说服,所以如果有人能在上面的推理/我的代码中指出错误,我会很高兴听到它。