NLTK: How to define the "labeled_featuresets" when creating a ClassifierBasedTagger with nltk?

时间:2019-03-19 14:52:36

标签: python classification nltk named-entity-recognition nltk-trainer

I am playing around with the nltk right now. I am trying to create various Classifiers with nltk, doing named entity recognition, to compare their results. Creating n-gram Taggers was easy, however I have run into some issues creating a ClassifierBasedTagger for Naive Bayes or the Decision Tree Classifier.

My Data is in the conll iob format. After reading it I covert it into a tuple that looks like that: (word, POS-tag), entity)

I have created the following class that creates the Classifiers:

class ClassifierChunker(ChunkParserI):

def __init__(self, trainSents, tagger,  **kwargs):
    if type(tagger) is not nltk.tag.sequential.UnigramTagger and  type(tagger) is not nltk.tag.sequential.BigramTagger and type(tagger) is not nltk.tag.sequential.TrigramTagger:
        self.featureDetector = tagger.feature_detector
    self.tagger = tagger

def parse(self, sentence):
    chunks = self.tagger.tag(sentence)
    iobTriblets = [(word, pos, entity) for ((word, pos), entity) in chunks]
    return conlltags2tree(iobTriblets)

def evaluate2(self, testSents):
    return self.evaluate([conlltags2tree([(word, pos, entity) for (word, pos), entity in iobs]) for iobs in testSents])

Thats how I call it:

 #naiveBayers
naiveBayers =  NaiveBayesClassifier.train
naiveBayersTagger = ClassifierBasedTagger(train=completeTaggedSentencesTrain, feature_detector=features, classifier_builder=naiveBayers)
nerChunkerNaiveBayers = ClassifierChunker(completeTaggedSentencesTrain, naiveBayersTagger)
evalNaiveBayers = nerChunkerNaiveBayers.evaluate2(completeTaggedSentencesTest)
print(evalNaiveBayers)

The problem I have is in the first line of code (naiveBayers = NaiveBayesClassifier.train) I know I am supposed to pass the train function a labeled feature-set. However I am not exactly sure on what that means. In the documentation it says the following:

:param labeled_featuresets: A list of classified featuresets, i.e., a list of tuples (featureset, label).

Would the featureset be the word and the label the entity?

After encountering this problem I have done some research and found the nltk-trainer. There the classifier_builder is created inside the args.py file, more specifically in the inner class "trainf" of the function "make_classifier_builder". However I have no idea where the variable "train_feats" is coming from. Maybe it has something to do with my limited understanding of inner functions. I can't find it being called anywhere.

I would really appriciate if someone could point me into the right direction.

edit: I have just read in the NLTK 3 Cookbook that the feature_detector function returns a feature set (p.143). So am I supposed to use that function in some way?

My current feature Detector looks the following and is taken out of that book:

def prev_next_pos_iob(tokens, index, history):
    word, pos = tokens[index]

    if index == 0:
        prevword, prevpos, previob = ('<START>',) * 3
    else:
        prevword, prevpos = tokens[index - 1]
        previob = history[index - 1]

    if index == len(tokens) - 1:
        nextword, nextpos = ('<END>',) * 2
    else:
        nextword, nextpos = tokens[index + 1]

    feats = {
        'word': word,
        'pos': pos,
        'nextword': nextword,
        'nextpos': nextpos,
        'prevword': prevword,
        'prevpos': prevpos,
        'previob': previob
    }
    return feats

0 个答案:

没有答案