在scikit中partial_fit中遇到的错误学习

时间:2015-09-21 13:54:33

标签: python machine-learning scikit-learn

在scikit中使用partial_fit函数进行训练时,我会在没有程序终止的情况下得到以下错误,即使受过训练的模型正确运行并提供正确的输出,这又如何可能,以及它的重复性。这有点担心吗?

/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py:207: RuntimeWarning: divide by zero encountered in log
  self.class_log_prior_ = (np.log(self.class_count_)

我正在使用以下修改过的训练函数,因为我必须维护一个常量的标签\类列表,因为partial_fit不允许在后续运行中添加新的类\标签,每组训练数据中的类优先级相同:

class MySklearnClassifier(SklearnClassifier):
    def train(self, labeled_featuresets,classes=None, partial=True):
        """
        Train (fit) the scikit-learn estimator.

        :param labeled_featuresets: A list of ``(featureset, label)``
            where each ``featureset`` is a dict mapping strings to either
            numbers, booleans or strings.
        """

        X, y = list(compat.izip(*labeled_featuresets))
        X = self._vectorizer.fit_transform(X)
        y = self._encoder.fit_transform(y)



        if partial:
            classes=self._encoder.fit_transform(list(set(classes)))
            self._clf.partial_fit(X, y, classes=list(set(classes)))
        else:
            self._clf.fit(X, y)

        return self

同样在第二次调用partial_fit时,它会引发类count = 2000的跟随错误,并且在调用model = self.train(featureset,classes = labels,partial = partial)时训练样本为3592:

self._clf.partial_fit(X, y, classes=list(set(classes)))
  File "/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 277, in partial_fit
    self._count(X, Y)
  File "/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 443, in _count
    self.feature_count_ += safe_sparse_dot(Y.T, X)
ValueError: operands could not be broadcast together with shapes (2000,11430) (2000,10728) (2000,11430) 

根据抛出的错误我在哪里出错了?这是否意味着我在推送不正确的尺寸数据? 我试着跟随,我现在打电话:

        X = self._vectorizer.transform(X)
        y = self._encoder.transform(y)

每次调用部分拟合时。之前我使用fittransform进行每个partialfit调用。这是正确的

class MySklearnClassifier(SklearnClassifier):
    def train(self, labeled_featuresets, classes=None, partial=False):
        """
        Train (fit) the scikit-learn estimator.

        :param labeled_featuresets: A list of ``(featureset, label)``
            where each ``featureset`` is a dict mapping strings to either
            numbers, booleans or strings.
        """

        X, y = list(compat.izip(*labeled_featuresets))

        if partial:
            classes = self._encoder.fit_transform(np.unique(classes))
            X = self._vectorizer.transform(X)
            y = self._encoder.transform(y)
            self._clf.partial_fit(X, y, classes=list(set(classes)))
        else:
             X = self._vectorizer.fit_transform(X)
             y = self._encoder.fit_transform(y)
             self._clf.fit(X, y)

        return self._clf

经过多次尝试,我能够通过计算第一次调用来获得以下代码,但是我假设分类器pickle文件在每次迭代后会增加大小,但我为每个批次获得相同大小的pkl文件这是不可能的:

 class MySklearnClassifier(SklearnClassifier):

    def train(self, labeled_featuresets, classes=None, partial=False,firstcall=True):
        """
        Train (fit) the scikit-learn estimator.

        :param labeled_featuresets: A list of ``(featureset, label)``
            where each ``featureset`` is a dict mapping strings to either
            numbers, booleans or strings.
        """

        X, y = list(compat.izip(*labeled_featuresets))

        if partial:

           if firstcall:
                classes = self._encoder.fit_transform(np.unique(classes))
                X = self._vectorizer.fit_transform(X)
                y = self._encoder.fit_transform(y)
                self._clf.partial_fit(X, y, classes=classes)
           else:

                X = self._vectorizer.transform(X)
                y = self._encoder.fit_transform(y)
                self._clf.partial_fit(X, y)
        else:
             X = self._vectorizer.fit_transform(X)
             y = self._encoder.fit_transform(y)
             self._clf.fit(X, y)

        return self

这是整个代码:

class postagger(ClassifierBasedTagger):
    """
    A classifier based postagger.
    """
    #MySklearnClassifier()
    def __init__(self, feature_detector=None, train=None,estimator=None,

                 classifierinstance=None, backoff=None,
                 cutoff_prob=None, verbose=True):

        if backoff is None:
            self._taggers = [self]
        else:
            self._taggers = [self] + backoff._taggers
        if estimator:
            classifier = MySklearnClassifier(estimator=estimator)
            #MySklearnClassifier.__init__(self, classifier)
        elif classifierinstance:
            classifier=classifierinstance






        if feature_detector is not None:
            self._feature_detector = feature_detector
            # The feature detector function, used to generate a featureset
            # or each token: feature_detector(tokens, index, history) -> featureset

        self._cutoff_prob = cutoff_prob
        """Cutoff probability for tagging -- if the probability of the
           most likely tag is less than this, then use backoff."""

        self._classifier = classifier
        """The classifier used to choose a tag for each token."""

        # if train and picklename:
        #     self._train(classifier_builder, picklename,tagged_corpus=train, ONLYERRORS=False,verbose=True,onlyfeatures=True ,LOADCONSTRUCTED=None)

    def legacy_getfeatures(self, tagged_corpus=None, ONLYERRORS=False, existingfeaturesetfile=None, verbose=True,
                           labels=artlabels):

        featureset = []
        labels=artlabels
        if not existingfeaturesetfile and tagged_corpus:
            if ONLYERRORS:

                classifier_corpus = open(tagged_corpus + '-ONLYERRORS.richfeature', 'w')
            else:
                classifier_corpus = open(tagged_corpus + '.richfeature', 'w')

            if verbose:
                print('Constructing featureset  for training corpus for classifier.')
            nlp = English()
            #df=pandas.DataFrame()
            store = HDFStore('featurestore.h5')



            for sentence in sPickle.s_load(open(tagged_corpus,'r')):
                untagged_words, tags, senindex = zip(*sentence)
                doc = nlp(u' '.join(untagged_words))
                # untagged_sentence, tags , rest = unpack_three(*zip(*sentence))
                for index in range(len(sentence)):
                    if ONLYERRORS:
                        if tags[index] == '<!SAME!>' and random.random() < 0.05:
                            featureset = self.new_feature_detector(doc, index)
                            sPickle.s_dump_elt((featureset, tags[index]), classifier_corpus)
                            featureset['label']=tags[index]
                            featureset['senindex']=str(senindex[0])
                            featureset['wordindex']=index
                            df=pandas.DataFrame([featureset])
                            store.append('df',df,index=False,min_itemsize = 150)
                            # classifier_corpus.append((featureset, tags[index]))
                        elif tags[index] in labels:
                            featureset = self.new_feature_detector(doc, index)
                            sPickle.s_dump_elt((featureset, tags[index]), classifier_corpus)
                            featureset['label']=tags[index]
                            featureset['senindex']=str(senindex[0])
                            featureset['wordindex']=index
                            df=pandas.DataFrame([featureset])
                            store.append('df',df,index=False,min_itemsize = 150)


                        # classifier_corpus.append((featureset, tags[index]))
        # else:
        #     for element in sPickle.s_load(open(existingfeaturesetfile, 'w')):
        #         featureset.append( element)

        return tagged_corpus + '.richfeature'

    def _train(self, featuresetdata, classifier_builder=MultinomialNB(), partial=False, batchsize=500):
        """
        Build a new classifier, based on the given training data
        *tagged_corpus*.

        """



        #labels = set(cPickle.load(open(arguments['-k'], 'r')))
        if partial==False:
           print('Training classifier FULLMODE')
           featureset = []
           for element in sPickle.s_load(open(featuresetdata, 'r')):
               featureset.append(element)

           model = self._classifier.train(featureset, classes=artlabels, partial=False,firstcall=True)
           print('Training complete, dumping')
           try:
            joblib.dump(model,  str(featuresetdata) + '-FULLTRAIN ' + slugify(str(classifier_builder))[:10] +'.mpkl')
            print "joblib dumped"
           except:
               print "joblib error"
           cPickle.dump(model, open(str(featuresetdata) + '-FULLTRAIN ' + slugify(str(classifier_builder))[:10] +'.cmpkl', 'w'))
           print('dumped')
           return
        #joblib.dump(self._classifier,str(datetime.datetime.now().hour)+'-naivebayes.pickle',compress=0)

        print('Training classifier each batch of {} training points'.format(batchsize))

        for i, batchelement in enumerate(batch(sPickle.s_load(open(featuresetdata, 'r')), batchsize)):
            featureset = []
            for element in batchelement:
                featureset.append(element)



            # model =  super(postagger, self).train (featureset, partial)
            # pdb.set_trace()
            # featureset = [item for sublist in featureset for item in sublist]
            trainsize = len(featureset)
            print("submitting {} training points for training\neg last one:".format(trainsize))
            for d, l in featureset:
                if len(d) != 113:
                    print d
                    assert False

            print featureset[-1]
            # pdb.set_trace()
            try:
                if i==0:
                    model = self._classifier.train(featureset, classes=artlabels, partial=True,firstcall=True)
                else:
                    model = self._classifier.train(featureset, classes=artlabels, partial=True,firstcall=False)

            except:
                type, value, tb = sys.exc_info()
                traceback.print_exc()
                pdb.post_mortem(tb)

            print('Training for batch {} complete, dumping'.format(i))
            cPickle.dump(model, open(
                str(featuresetdata) + '-' + slugify(str(classifier_builder))[
                                            :10] + 'UPDATED batch-{} of {} points.mpkl'.format(
                    i, trainsize), 'w'))
            print('dumped')
        #joblib.dump(self._classifier,str(datetime.datetime.now().hour)+'-naivebayes.pickle',compress=0)

    def untag(self,tagged_sentence):
        """
        Given a tagged sentence, return an untagged version of that
        sentence.  I.e., return a list containing the first element
        of each tuple in *tagged_sentence*.

            >>> from nltk.tag.util import untag
            >>> untag([('John', 'NNP'), ('saw', 'VBD'), ('Mary', 'NNP')])
            ['John', 'saw', 'Mary']

        """

        return [w[0] for w in tagged_sentence]

    def evaluate(self, gold):
        """
        Score the accuracy of the tagger against the gold standard.
        Strip the tags from the gold standard text, retag it using
        the tagger, then compute the accuracy score.

        :type gold: list(list(tuple(str, str)))
        :param gold: The list of tagged sentences to score the tagger on.
        :rtype: float
        """
        gold_tokens=[]
        full_gold_tokens=[]

        tagged_sents = self.tag_sents(self.untag(sent) for sent in gold)
        for sentence in gold:#flatten the list

            untagged_sentences, goldtags,type_feature,startpos_feature,sentence_feature,senindex_feature = zip(*sentence)


            gold_tokens.extend(zip(untagged_sentences,goldtags))
            full_gold_tokens.extend(zip( untagged_sentences, goldtags,type_feature,startpos_feature,sentence_feature,senindex_feature))





        test_tokens = sum(tagged_sents, []) #flatten the list
        getmismatch(gold_tokens,test_tokens,full_gold_tokens)
        return accuracy(gold_tokens, test_tokens)

    #
    def new_feature_detector(self, tokens, index):
        return getfeatures(tokens, index)


    def tag_sents(self, sentences):
        """
        Apply ``self.tag()`` to each element of *sentences*.  I.e.:

            return [self.tag(sent) for sent in sentences]
        """
        return [self.tag(sent) for sent in sentences]

    def tag(self, tokens):
        # docs inherited from TaggerI
        tags = []
        for i in range(len(tokens)):
            tags.append(self.tag_one(tokens, i))
        return list(zip(tokens, tags))

    def tag_one(self, tokens, index):
        """
        Determine an appropriate tag for the specified token, and
        return that tag.  If this tagger is unable to determine a tag
        for the specified token, then its backoff tagger is consulted.

        :rtype: str
        :type tokens: list
        :param tokens: The list of words that are being tagged.
        :type index: int
        :param index: The index of the word whose tag should be
            returned.
        :type history: list(str)
        :param history: A list of the tags for all words before *index*.
        """
        tag = None
        for tagger in self._taggers:
            tag = tagger.choose_tag(tokens, index)
            if tag is not None:  break
        return tag

    def choose_tag(self, tokens, index):
        # Use our feature detector to get the featureset.
        featureset = self.new_feature_detector(tokens, index)

        # Use the classifier to pick a tag.  If a cutoff probability
        # was specified, then check that the tag's probability is
        # higher than that cutoff first; otherwise, return None.

        if self._cutoff_prob is None:
            return self._classifier.prob_classify_many([featureset])
            #return self._classifier.classify_many([featureset])


        pdist = self._classifier.prob_classify_many([featureset])
        tag = pdist.max()
        return tag if pdist.prob(tag) >= self._cutoff_prob else None

2 个答案:

答案 0 :(得分:3)

1。 RuntimeWarning

您收到此警告是因为{0}上调用了np.log

In [6]: np.log(0)
/home/anaconda/envs/python34/lib/python3.4/site-packages/ipykernel/__main__.py:1: RuntimeWarning: divide by zero encountered in log
  if __name__ == '__main__':
Out[6]: -inf

那是因为在你的一个电话中,有些类根本没有表示(它们的计数为0),因此{0}被调用为0.你不需要担心它。

2。班级先修

  

我正在使用以下修改过的训练函数,因为我必须维护一个常量的标签\类列表,因为partial_fit不允许在后续运行中添加新的类\标签,每个训练数据中的类优先级相同< / p>

  • 如果您使用np.log,则需要从一开始就传递标签/类列表。
  • 我不确定每批培训数据中的课程是否相同。这可能有几个不同的含义,如果你能澄清一下你的意思,那就太好了 同时,partial_fit等分类器的默认行为是它们适合数据的先验(基本上它们计算频率)。使用MultinomialNB时,他们将逐步进行此计算,以便获得与使用单个partial_fit调用时相同的结果。

3。你的错误

  

同样在第二次调用partial_fit时,它会引发类count = 2000的跟随错误,并且在调用model = self.train(featureset,classes = labels,partial = partial)时训练样本为3592

这里我们需要更多细节。我很困惑fit形状为X,但在追溯中它似乎是(n_samples, n_features)形状。这意味着(2000,11430)有2000个样本。

错误确实意味着输入的维度不一致。我建议为每次X调用后打印X.shapey.shape 进行矢量化

此外,您不应该在为每个partial_fit电话转换fit的矢量化工具上调用fit_transformX:您应该适合一次,然后只需转换X.这是为了确保您为转换后的X获得一致的维度。

4。你早些时候&#39;溶液

以下是您告诉我们您使用的代码:

partial_fit

据我所知,这并没有太大的错误,但我们真的需要更多关于你如何在这里使用它的背景。
挑剔:如果你将class MySklearnClassifier(SklearnClassifier): def train(self, labeled_featuresets, classes=None, partial=False): """ Train (fit) the scikit-learn estimator. :param labeled_featuresets: A list of ``(featureset, label)`` where each ``featureset`` is a dict mapping strings to either numbers, booleans or strings. """ X, y = list(compat.izip(*labeled_featuresets)) if partial: classes = self._encoder.fit_transform(np.unique(classes)) X = self._vectorizer.transform(X) y = self._encoder.transform(y) self._clf.partial_fit(X, y, classes=list(set(classes))) else: X = self._vectorizer.fit_transform(X) y = self._encoder.fit_transform(y) self._clf.fit(X, y) return self._clf 变量作为类属性,我觉得它会更清楚,因为每个classes调用的变量必须相同。
如果将不同的值传递给构造函数partial_fit参数,那么在这里可能会出错。

可以帮助我们的更多信息:

  • X.shape,y.shape。
  • 的印刷品
  • 上下文:您如何使用您提供的代码?
  • 您在classes_vectorizer使用了什么?你最终合作的是什么分类器?

答案 1 :(得分:1)

我相信这里的问题可以在错误

中看到
ValueError: operands could not be broadcast together with shapes (2000,11430) (2000,10728) (2000,11430)

注意(2000,11430)(2000, 10728 )(2000,11430),您提供的数据集具有不同数量的功能,因此它无法确定功能的数量,因为它们在第二组中是不同的,这给出了错误。它只显示错误并且不会崩溃,因为错误很可能是在try catch块中捕获的。

您可能不希望为partial_fit函数提供不同的功能(属性)集。你的算法仍然可以执行,因为它适用于剩下的两个chuncks,但第二个chunck很可能被忽略。