我正在查看sci-kit模块中的“隔离林算法”的源代码。该模型的整体结构为:
read data set
train test split of data
train the data on model
fit the model
test other set of data on the model
predict anomalies
源代码在GitHub中可用。在源代码中,有一个名为iforest.py的文件(路径:sklearn> ensemble> iforest.py) 在该文件中,有一个名为score_samples的函数,该函数计算数据集中每个数据点的异常得分。 代码看起来像这样:
def score_samples(self, X):
n_samples = X.shape[0]
n_samples_leaf = numpy.zeros((n_samples, self.n_estimators), order="f")
depths = numpy.zeros((n_samples, self.n_estimators), order="f")
if self._max_features == X.shape[1]:
subsample_features = False
else:
subsample_features = True
for i, (tree, features) in enumerate(zip(self.estimators_ self.estimators_features_)):
if subsample_features:
X_subset = X[:, features]
else:
X_subset = X
leaves_index = tree.apply(X_subset)
node_indicator = tree.decision_path(X_subset)
n_samples_leaf[:, i] = tree.tree_.n_node_samples[leaves_index]
depths[:, i] = numpy.ravel(node_indicator.sum(axis=1))
depths[:, i] -= 1
depths += _average_path_length(n_samples_leaf)
scores = 2 ** (-depths.mean(axis=1) / _average_path_length(
self.max_samples_))
# Take the opposite of the scores as bigger is better (here less
# abnormal)
return -scores
我无法理解for循环中发生了什么。有人可以帮我吗?