隔离林算法源代码的解释

时间:2019-02-04 06:55:36

标签: python machine-learning scikit-learn

我正在查看sci-kit模块中的“隔离林算法”的源代码。该模型的整体结构为:

read data set
train test split of data
train the data on model
fit the model
test other set of data on the model
predict anomalies

源代码在GitHub中可用。在源代码中,有一个名为iforest.py的文件(路径:sklearn> ensemble> iforest.py) 在该文件中,有一个名为score_samples的函数,该函数计算数据集中每个数据点的异常得分。 代码看起来像这样:

def score_samples(self, X):

    n_samples = X.shape[0]
    n_samples_leaf = numpy.zeros((n_samples, self.n_estimators), order="f")
    depths = numpy.zeros((n_samples, self.n_estimators), order="f")
    if self._max_features == X.shape[1]:
        subsample_features = False
    else:
        subsample_features = True

    for i, (tree, features) in enumerate(zip(self.estimators_ self.estimators_features_)):                                     
        if subsample_features:
            X_subset = X[:, features]
        else:
            X_subset = X
        leaves_index = tree.apply(X_subset)
        node_indicator = tree.decision_path(X_subset)
        n_samples_leaf[:, i] = tree.tree_.n_node_samples[leaves_index]
        depths[:, i] = numpy.ravel(node_indicator.sum(axis=1))
        depths[:, i] -= 1

    depths += _average_path_length(n_samples_leaf)

    scores = 2 ** (-depths.mean(axis=1) / _average_path_length(
        self.max_samples_))

    # Take the opposite of the scores as bigger is better (here less
    # abnormal)
    return -scores

我无法理解for循环中发生了什么。有人可以帮我吗?

0 个答案:

没有答案