我使用此功能绘制每个标签的最佳和最差特征(系数)。
def plot_coefficients(classifier, feature_names, top_features=20):
coef = classifier.coef_.ravel()
for i in np.split(coef,6):
top_positive_coefficients = np.argsort(i)[-top_features:]
top_negative_coefficients = np.argsort(i)[:top_features]
top_coefficients = np.hstack([top_negative_coefficients, top_positive_coefficients])
# create plot
plt.figure(figsize=(15, 5))
colors = ["red" if c < 0 else "blue" for c in i[top_coefficients]]
plt.bar(np.arange(2 * top_features), i[top_coefficients], color=colors)
feature_names = np.array(feature_names)
plt.xticks(np.arange(1, 1 + 2 * top_features), feature_names[top_coefficients], rotation=60, ha="right")
plt.show()
将其应用于sklearn.LinearSVC:
if (name == "LinearSVC"):
print(clf.coef_)
print(clf.intercept_)
plot_coefficients(clf, cv.get_feature_names())
使用的CountVectorizer的尺寸为(15258, 26728)
。
这是一个带有6个标签的多类别决策问题。使用.ravel
返回长度为6*26728=160368
的平面数组。这意味着所有高于26728的索引都超出了轴1的范围。这是一个标签的顶部和底部索引:
i[ 0. 0. 0.07465654 ... -0.02112607 0. -0.13656274]
Top [39336 35593 29445 29715 36418 28631 28332 40843 34760 35887 48455 27753
33291 54136 36067 33961 34644 38816 36407 35781]
i[ 0. 0. 0.07465654 ... -0.02112607 0. -0.13656274]
Bot [39397 40215 34521 39392 34586 32206 36526 42766 48373 31783 35404 30296
33165 29964 50325 53620 34805 32596 34807 40895]
“顶部”列表中的第一个条目的索引为39336。这等于词汇表中的条目39337-26728 = 12608。我需要在代码中进行哪些更改才能使其适用?
编辑:
X_train = sparse.hstack([training_sentences,entities1train,predictionstraining_entity1,entities2train,predictionstraining_entity2,graphpath_training,graphpathlength_training])
y_train = DFTrain["R"]
X_test = sparse.hstack([testing_sentences,entities1test,predictionstest_entity1,entities2test,predictionstest_entity2,graphpath_testing,graphpathlength_testing])
y_test = DFTest["R"]
尺寸:
(15258, 26728)
(15258, 26728)
(0, 0) 1
...
(15257, 0) 1
(15258, 26728)
(0, 0) 1
...
(15257, 0) 1
(15258, 26728)
(15258L, 1L)
File "TwoFeat.py", line 708, in plot_coefficients
colors = ["red" if c < 0 else "blue" for c in i[top_coefficients]]
MemoryError
答案 0 :(得分:1)
首先,您是否必须使用true
?
LinearSVC(或实际上具有ravel()
的任何其他分类器)以如下形式给出coef_
:
coef_
因此,它的行数等于类,而列数等于要素。对于每个类,您只需要访问右行。类的顺序将在coef_ : array, shape = [n_features] if n_classes == 2 else [n_classes, n_features]
Weights assigned to the features (coefficients in the primal problem).
属性中提供。
第二,代码缩进是错误的。绘图应位于for循环内的代码,以针对每个类进行绘图。当前,它不在for循环的范围内,因此仅在上一类中打印。
更正了这两件事,下面是一个可复制的示例代码,用于绘制每个类的顶部和底部功能。
classifier.classes_
现在只要您喜欢就可以使用此方法:
def plot_coefficients(classifier, feature_names, top_features=20):
# Access the coefficients from classifier
coef = classifier.coef_
# Access the classes
classes = classifier.classes_
# Iterate the loop for number of classes
for i in range(len(classes)):
print(classes[i])
# Access the row containing the coefficients for this class
class_coef = coef[i]
# Below this, I have just replaced 'i' in your code with 'class_coef'
# Pass this to get top and bottom features
top_positive_coefficients = np.argsort(class_coef)[-top_features:]
top_negative_coefficients = np.argsort(class_coef)[:top_features]
# Concatenate the above two
top_coefficients = np.hstack([top_negative_coefficients,
top_positive_coefficients])
# create plot
plt.figure(figsize=(10, 3))
colors = ["red" if c < 0 else "blue" for c in class_coef[top_coefficients]]
plt.bar(np.arange(2 * top_features), class_coef[top_coefficients], color=colors)
feature_names = np.array(feature_names)
# Here I corrected the start to 0 (Your code has 1, which shifted the labels)
plt.xticks(np.arange(0, 1 + 2 * top_features),
feature_names[top_coefficients], rotation=60, ha="right")
plt.show()
上述代码的输出: