Question

我正在 RandomForestClassifier 中训练 pyspark.ml，当尝试通过 Estimator 的 featureImportances 属性获取训练模型的特征重要性时，我在返回的元组中看不到任何内容特征指数或重要性权重：

(37,[],[])

我希望像...

(37,[<feature indices>],[<feature importance weights>])

...（当然不能让它完全空白）。奇怪的是 b/c 它似乎认识到有 37 个功能，但在其他列表中没有任何信息。 the docs 中的任何内容似乎都没有解决这个问题。

这里会发生什么？

Answer 1

TLDR：稀疏向量通常以特定方式表示。如果您的稀疏向量被打印为空，则可能意味着您的稀疏向量中的所有值都为零。

检查/打印 RandomForestClassificationModel Transformer 的 type 属性的 featureImportance，我们可以看到它是一个 SparseVector。在大多数情况下，当打印稀疏向量时，您会看到类似...

(<size>, <list of non-zero indices>, <list of non-zero values associated with the indices>)

...（如果有人有任何文件链接，确认这是如何解释稀疏向量，请告诉我 b/c 我不记得我是如何知道这一点或在哪里可以确认的）。< /p>

SparseVectors 的打印示例如下所示：

from  pyspark.mllib.linalg import SparseVector
import pprint
a = SparseVector(5,{})
print(a)
# (5,[],[])
pprint.pprint(a)
# SparseVector(5, {})
pprint.pprint(a.toArray())
# array([0., 0., 0., 0., 0.])
 
 
b = SparseVector(5,{0:1, 2:3, 4:5})
print(b)
# (5,[0,2,4],[1.0,3.0,5.0])
pprint.pprint(b)
# SparseVector(5, {0: 1.0, 2: 3.0, 4: 5.0})
pprint.pprint(b.toArray())
# array([1., 0., 3., 0., 5.])

因此，如果您为 (<size>, [], []) 获得了像 featureImportances 这样的稀疏向量，（我很确定）这意味着 Estimator 没有发现您的任何特征特别重要（即，遗憾的是，你/我选择的特征不是很好（至少从 Estimator 的 POV 来看），需要进行更多的数据分析。

pyspark.ml 随机森林模型特征重要性结果为空？

1 个答案: