对python pandas数据帧进行K折交叉验证 - NLTK分类

时间:2018-01-09 00:02:56

标签: python pandas nlp nltk cross-validation

我想使用10倍交叉验证来评估nltk分类模型。这是pandas数据框架命名为:data(有10k行和10个类)

  

功能:hello_variant,goodbye_variant,wh_question,yesNo_question,   conjuction_start,No_of_tokens

enter image description here

我试过下面的代码。但它给出了一个错误

extract_features = data.drop(['class'],axis=1)
documents = data['class']

import nltk
from sklearn import cross_validation
training_set = nltk.classify.apply_features(extract_features, documents)
cv = cross_validation.KFold(len(training_set), n_folds=10,  shuffle=False, random_state=None)

for traincv, testcv in cv:
    classifier = nltk.NaiveBayesClassifier.train(training_set[traincv[0]:traincv[len(traincv)-1]])
    print 'accuracy:', nltk.classify.util.accuracy(classifier, training_set[testcv[0]:testcv[len(testcv)-1]])

错误:

> --------------------------------------------------------------------------- ValueError                                Traceback (most recent call
> last) <ipython-input-253-2ddaf7264527> in <module>()
>       1 import nltk
>       2 from sklearn import cross_validation
> ----> 3 training_set = nltk.classify.apply_features(extract_features, documents)
>       4 cv = cross_validation.KFold(len(training_set), n_folds=10,  shuffle=False, random_state=None)
>       5 
> 
> C:\Users\SampathR\Anaconda2\envs\dato-env\lib\site-packages\nltk\classify\util.pyc
> in apply_features(feature_func, toks, labeled)
>      60     """
>      61     if labeled is None:
> ---> 62         labeled = toks and isinstance(toks[0], (tuple, list))
>      63     if labeled:
>      64         def lazy_func(labeled_token):
> 
> C:\Users\SampathR\Anaconda2\envs\dato-env\lib\site-packages\pandas\core\generic.pyc
> in __nonzero__(self)
>     712         raise ValueError("The truth value of a {0} is ambiguous. "
>     713                          "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
> --> 714                          .format(self.__class__.__name__))
>     715 
>     716     __bool__ = __nonzero__
> 
> ValueError: The truth value of a Series is ambiguous. Use a.empty,
> a.bool(), a.item(), a.any() or a.all().

进一步我想获得语料库(类)中每个对话行为的精确度,回忆率和F值,以及分类器的准确度和混淆矩阵。 NLTK有什么方法可以计算出来吗? (除了sklearn)

0 个答案:

没有答案