将句子分为多个类别

时间:2017-11-13 16:06:26

标签: python-3.x machine-learning scikit-learn nltk text-classification

NLTK和Scikit的初学者 - 在这里学习。我希望能够将给定的句子(甚至是段落)分类为一组类别。按类别划分,我并不仅仅指垃圾邮件,垃圾邮件或情绪不佳以及情绪不佳等两个类别,这意味着可以选择多个(两个以上)类别。请帮助我选择最简单的算法来解决这个问题。提前谢谢。

2 个答案:

答案 0 :(得分:1)

如果我是对的,您正在尝试对数据集执行主题建模。 就我而言,你可以使用LDA(Latent Dirichlet分配),但你有义务指定主题的数量,你可以做几个测试来找到适当的主题数量值。 这是使用python执行的LDA的示例,并演示了如何检查路透社新闻数据集子集的模型。下面的输入X是文档术语矩阵。

 >>> import numpy as np
>>> import lda
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> X.sum()
84010
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X)  # model.fit_transform(X) is also available
>>> topic_word = model.topic_word_  # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
...     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
...     print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: british churchill sale million major letters west
Topic 1: church government political country state people party
Topic 2: elvis king fans presley life concert young
Topic 3: yeltsin russian russia president kremlin moscow michael
Topic 4: pope vatican paul john surgery hospital pontiff
Topic 5: family funeral police miami versace cunanan city
Topic 6: simpson former years court president wife south
Topic 7: order mother successor election nuns church nirmala
Topic 8: charles prince diana royal king queen parker
Topic 9: film french france against bardot paris poster
Topic 10: germany german war nazi letter christian book
Topic 11: east peace prize award timor quebec belo
Topic 12: n't life show told very love television
Topic 13: years year time last church world people
Topic 14: mother teresa heart calcutta charity nun hospital
Topic 15: city salonika capital buddhist cultural vietnam byzantine
Topic 16: music tour opera singer israel people film
Topic 17: church catholic bernardin cardinal bishop wright death
Topic 18: harriman clinton u.s ambassador paris president churchill
Topic 19: city museum art exhibition century million churches

答案 1 :(得分:-1)

根据您在帖子中使用的标记,我看到您了解machine learning ...这是执行此项目的好方法。

你需要的是一个相当数量的样本数据,即一个文本表(样本句子,段落,等等......),然后是一个列出它所在类别的列。

您所做的是train程序,以查找示例文本中的模式,如果您有足够的示例数据,则可以analyze文本,并让程序输出它的类别是

您可以使用TensorFlow作为您的机器学习框架。

我建议你从几个更简单的项目开始,以了解机器学习的工作原理和效果最佳。