使用语义词表示(例如word2vec)来构建分类器

时间:2015-07-13 18:32:23

标签: python classification word2vec

我想为论坛帖子构建一个自动分类的分类器 发布到一些定义的类别(所以多类分类不仅仅是二进制 通过使用语义词表示来分类。对于这个任务我想要使用 word2vec和doc2vec并检查使用这些模型支持快速的可行性 选择分类器的训练数据。此刻我尝试了两种型号和 他们的工作就像魅力。但是,因为我不想手动标记每个句子来预测 它是什么描述,我想把这个任务留给word2vec或doc2vec模型。所以, 我的问题是:我可以在Python中为分类器使用什么算法? ( 我刚在想 在word2vec或doc2vec上应用一些聚类 - 手动标记每个聚类(这个 需要一些时间,而不是最好的解决方案)。以前,我利用了 " LinearSVC"(来自SVM)和OneVsRestClassifier,但是,我标记了每个句子(通过 手动训练矢量" y_train" )为了预测哪个类进行了新的测试 句子属于。在python中使用什么是一个好的算法和方法 这种类型的分类器(利用语义词表示来训练数据)?

2 个答案:

答案 0 :(得分:2)

The issue with things like word2vec/doc2vec and so on - actually any usupervised classifier - is that it just uses context. So, for example if I have a sentence like "Today is a hot day" and another like "Today is a cold day" it thinks hot and cold are very very similar and should be in the same cluster.

This makes it pretty bad for tagging. Either way, there is a good implementation of Doc2Vec and Word2Vec in gensim module for python - you can quickly use the google-news dataset's prebuilt binary and test whether you get meaningful clusters.

The other way you could try is implement a simple lucene/solr system on your computer and begin tagging a few sentences randomly. Over time lucene/solr will suggest tags clearfor your document, and they do come out to be pretty decent tags if your data is not really bad.

The issue here is the problem youre trying to solve isnt particularly easy nor is completely solvable - If you have very good/clear data, then you may be able to auto classify about 80-90% of your data ... but if it is bad, you wont be able to auto classify it much.

答案 1 :(得分:0)

对于句子的多类分类问题,doc2vec可以正常工作,因为上下文在句子中很少发生很大变化。

如果你只想使用python,我会建议使用doc2vec(用于构建功能),然后是xgboost(用于训练分类器),这对我有类似的问题。