Question

晚上好人，

我试图在Spark中找到一种解释随机森林的方法。通过解释，我的意思是找出哪些变量在特定行中最具影响力。

使用python，我曾经这样做过：

from treeinterpreter import treeinterpreter as ti
prediction, bias, contributions = ti.predict(rfc, X)

econtributions数组拥有我需要的所有信息，然后我可以操纵它以获得所需的结果。有没有办法用python中的spark做到这一点？

Answer 1

我猜您在谈论功能的重要性。使用Pipeline对象时，在pyspark.ml中的实现：

tree = model.stages[-1]
# load feature importance from the model object
print(tree.featureImportances)

# You can also print the trees with nodes: 
print('Trees with Nodes: {}'.format(tree.toDebugString))

或者在没有管道的情况下使用pyspark.ml时：

from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier()
model = rf.fit(data)
print(model.featureImportances)

解释pySpark中的随机森林

1 个答案: