解决您的问题

Question

我想使用sklearn中潜在的dirichlet分配来进行异常检测。我需要获得方程here中正式描述的新样本的可能性。

我怎么能得到它？

Answer 1

解决您的问题

您应该使用模型的score()方法，该方法返回传入文档的对数似然性。

假设您已根据论文创建了文档，并为每个主机培训了LDA模型。然后，您应该从所有培训文档中获得最低可能性并将其用作阈值。示例未经测试的代码如下：

import numpy as np
from sklearn.decomposition import LatentDirichletAllocation

# Assuming X contains a host's training documents
# and X_unknown contains the test documents
lda = LatentDirichletAllocation(... parameters here ...)
lda.fit(X)
threshold = min([lda.score([x]) for x in X])
attacks = [
    i for i, x in enumerate(X_unknown)
    if lda.score([x]) < threshold
]

# attacks now contains the indexes of the anomalies

正是您提出的问题

如果你想在你链接的论文中使用精确的等式，我建议不要在scikit-learn中尝试这样做，因为期望步骤界面不明确。

参数θ和φ可以在doc_topic_d行norm_phi和_update_doc_distribution()找到。函数theta = doc_topic_d / doc_topic_d.sum() # see the variables exp_doc_topic_d in the source code # in the function _update_doc_distribution() phi = np.dot(exp_doc_topic_d, exp_topic_word_d) + EPS返回doc_topic_distribution和足够的统计信息，您可以通过以下再次未经测试的代码尝试推断θ和φ：

{{1}}

对另一个图书馆的建议

如果您想要更多地控制期望和最大化步骤以及变分参数，我建议您查看112 - 130，特别是LDA++（免责声明我是LDA ++的作者之一）

sklearn来自潜在的dirichlet分配的可能性

1 个答案:

解决您的问题

正是您提出的问题

对另一个图书馆的建议