Question

我有一组trainFeatures和一组带有正面，中性和负面标签的testFeatures：

trainFeats = negFeats + posFeats + neutralFeats
testFeats  = negFeats + posFeats + neutralFeats

例如，trainFeats中的一个条目是

(['blue', 'yellow', 'green'], 'POSITIVE')

测试功能列表相同，因此我为每个集指定了标签。我的问题是如何使用随机森林分类器和SVM的scikit实现来获得这个分类器的准确性与每个类的精确度和召回分数？问题是我目前正在使用单词作为功能，而从我读到的这些分类器需要数字。有没有办法在不改变功能的情况下实现我的目的？非常感谢！

Answer 1

您可以查看此scikit-learn tutorial，尤其是the section on learning and predicting，了解如何创建和使用分类器。该示例使用SVM，但使用RandomForestClassifier很简单，因为所有分类器都实现fit和predict方法。

使用文字功能时，您可以使用CountVectorizer或DictVectorizer。请查看feature extraction，尤其是section 4.1.3。

您可以找到分类文本文档here的示例。

然后，您可以使用classification report获得分类器的精确度和召回率。