如何在Gensim Topic建模上预测测试数据

时间:2019-04-22 05:19:09

标签: python jupyter-notebook gensim topic-modeling mallet

我已经使用Gensim LDAMallet进行主题建模,但是我们可以通过哪种方式预测样本段落并使用预先训练的模型来获取其主题模型。

kind: Deployment
metadata:
    name: mongo-deployment
    labels:
      app: mongo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mongodb
  template:
    metadata:
      labels:
        app: mongodb
    spec:
      containers:
      - image: mongo:3.4.20
        name: mongo
        ports:
        - name: mongo
          containerPort: 27017
          hostPort: 27017
        volumeMounts:
        - mountPath: "/data/db"
          name: db-storage
      volumes:
        - name: db-storage
          persistentVolumeClaim:
            claimName: db-storage

如何使用此文本(a)从预训练的模型中获取其主题。请帮忙。

1 个答案:

答案 0 :(得分:0)

您将要像处理经过训练的集合一样处理'a':

# import a new data set to be passed through the pre-trained LDA

data_new = pd.read_csv('YourNew.csv', encoding = "ISO-8859-1");
data_new = data_new.dropna()
data_text_new = data_new[['Your Target Column']]
data_text_new['index'] = data_text_new.index

documents_new = data_text_new

# process the new data set through the lemmatization, and stopwork functions

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            nltk.bigrams(token)
            result.append(lemmatize_stemming(token))
    return result

processed_docs_new = documents_new['Your Target Column'].map(preprocess)

# create a dictionary of individual words and filter the dictionary
dictionary_new = gensim.corpora.Dictionary(processed_docs_new[:])
dictionary_new.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

# define the bow_corpus
bow_corpus_new = [dictionary_new.doc2bow(doc) for doc in processed_docs_new]

然后,您可以将其作为函数传递通过:

a = ldamallet[bow_corpus_new[:len(bow_corpus_new)]]
b = data_text_new

topic_0=[]
topic_1=[]
topic_2=[]

for i in a:
    topic_0.append(i[0][1])
    topic_1.append(i[1][1])
    topic_2.append(i[2][1])
    
d = {'Your Target Column': b['Your Target Column'].tolist(),
     'topic_0': topic_0,
     'topic_1': topic_1,
     'topic_2': topic_2}
     
df = pd.DataFrame(data=d)
df.to_csv("YourAllocated.csv", index=True, mode = 'a')

我希望这会有所帮助:)