Question

我在超过60000个条目的语料库上使用了LDA，并获得了不错的结果。但是，在这个语料库中再插入200行并重新启动LDA，我的话题就完全不同了。但是，200行并不代表语料库的1％。通常，结果不应更改。我一直在寻找有关LDA模型的敏感性和稳定性的信息，并且我发现它们在参数级别上非常敏感……有人对此有所了解吗？

此脚本：

import pandas
import mglearn
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

text = pandas.read_csv('pretraitement_janaina_modif.csv', encoding = 'utf-8')
text_list = text.values.tolist()

vector = CountVectorizer()
X = vector.fit_transform(text_list)

lda_model = LatentDirichletAllocation(n_components = 30, learning_method = "batch", max_iter = 25, random_state = 0)
document_topics = lda_model.fit_transform(X)
sorting = np.argsort(lda_model.components_, axis = 1)[:, ::-1]
feature_names = np.array(vector.get_feature_names())

topics = mglearn.tools.print_topics(topics = range(30), feature_names = feature_names, sorting = sorting, topics_per_chunk = 5, n_words = 10)

print(topics)

我首先得到的是以下主题列表：

主题0：未收到退回的订单邮件始终共享退款答案

主题1：取消订单的愿望可能希望提高被欺骗的错误项目

主题2：保持当前知情的交货订单知情的产品告知面孔编号

问题3：有缺陷的木箱礼物箱侧面水平的油漆缺陷箱

但是当我在语料库中添加几行时，主题就会改变：

主题0：大号金属大尺寸损坏的产品吊灯烤面包片支持

主题1：收货的订购商品错漏了订单，不只寄回

主题2：有缺陷的木箱礼物箱侧面水平的绘画缺陷箱

主题3：订单付款金额，支出转帐，支付银行业绩斐然

我不明白的是，当我们添加几行代码时，该算法如何变得如此敏感而不稳定，并且变化如此之大……

LDA模型的灵敏度和稳定性

0 个答案: