在scikit-learn中加载自定义文本数据时出现问题

时间:2017-05-11 13:59:21

标签: python numpy scikit-learn

我在SciKit-Learn中加载自定义数据时遇到问题,以便使用分类器通过Python查找命名实体。我不得不说我是新手使用Scikit-Learn,我想我需要一个numpy-array作为输入,如果我是对的。

所以,这是我的问题:我有Collntag-Format训练数据:

Where   WRB O
the DT  O
disposer    NN  B-Per
is  VBZ O
a   DT  O
non-legal   JJ  B-Per
entity  NN  I-Per
,   ,   O
the DT  O
identifier  NN  B-Per
specified   VBN O
in  IN  O
Article NNP B-Law
7   CD  O
shall   MD  O
be  VB  O
used    VBN O
.   .   O

现在我已经使用以下示例帮助构建了我的分类器: 1)我有一个Feature_detector函数

stemmer = SnowballStemmer('english')
def ner_features(tokens, index, history):
    """
    `tokens`  = a POS-tagged sentence [(w1, t1), ...]
    `index`   = the index of the token we want to extract features for
    `history` = the previous predicted IOB tags
    """

    tokens = [('[START2]', '[START2]'), ('[START1]', '[START1]')] + list(tokens) + [('[END1]', '[END1]'), ('[END2]', '[END2]')]
    history = ['[START2]', '[START1]'] + list(history)

    # shift the index with 2, to accommodate the padding
    index += 2

    feat_dict = {
        'word': word,
        'lemma': stemmer.stem(word),
        'pos': pos,
        'shape': shape(word),

        'next-word': nextword,
        'next-pos': nextpos,
        'next-lemma': stemmer.stem(nextword),
        'next-shape': shape(nextword),

        'next-next-word': nextnextword,
        'next-next-pos': nextnextpos,
        'next-next-lemma': stemmer.stem(nextnextword),
        'next-next-shape': shape(nextnextword),

        'prev-word': prevword,
        'prev-pos': prevpos,
        'prev-lemma': stemmer.stem(prevword),
        'prev-iob': previob,
        'prev-shape': shape(prevword),

        'prev-prev-word': prevprevword,
        'prev-prev-pos': prevprevpos,
        'prev-prev-lemma': stemmer.stem(prevprevword),
        'prev-prev-iob': prevpreviob,
        'prev-prev-shape': shape(prevprevword),
    }

    return feat_dict

2)我有一个来自sklearn的Perceptron-Classifier

class ScikitLearnChunker(ChunkParserI):

    @classmethod
    def to_dataset(cls, parsed_sentences, feature_detector):
        """
        Transform a list of tagged sentences into a scikit-learn compatible POS dataset
        :param parsed_sentences:
        :param feature_detector:
        :return:
        """
        X, y = [], []
        for parsed in parsed_sentences:
            iob_tagged = tree2conlltags(parsed)
            words, tags, iob_tags = zip(*iob_tagged)

            tagged = zip(words, tags)

            for index in range(len(iob_tagged)):
                X.append(feature_detector(tagged, index, history=iob_tags[:index]))
                y.append(iob_tags[index])

        return X, y

    @classmethod
    def get_minibatch(cls, parsed_sentences, feature_detector, batch_size=500):
        batch = list(itertools.islice(parsed_sentences, batch_size))
        X, y = cls.to_dataset(batch, feature_detector)
        return X, y

    @classmethod
    def train(cls, parsed_sentences, feature_detector, all_classes, **kwargs):
        X, y = cls.get_minibatch(parsed_sentences, feature_detector, kwargs.get('batch_size', 500))
        vectorizer = DictVectorizer(sparse=False)
        vectorizer.fit(X)


        clf = Perceptron(verbose=10, n_jobs=-1, n_iter=kwargs.get('n_iter', 5))

        while len(X):
            X = vectorizer.transform(X)
            clf.partial_fit(X, y, all_classes)
            X, y = cls.get_minibatch(parsed_sentences, feature_detector, kwargs.get('batch_size', 500))

        clf = Pipeline([
            ('vectorizer', vectorizer),
            ('classifier', clf)
        ])

        return cls(clf, feature_detector)

    def __init__(self, classifier, feature_detector):
        self._classifier = classifier
        self._feature_detector = feature_detector

    def parse(self, tokens):
        """
        Chunk a tagged sentence
        :param tokens: List of words [(w1, t1), (w2, t2), ...]
        :return: chunked sentence: nltk.Tree
        """
        history = []
        iob_tagged_tokens = []
        for index, (word, tag) in enumerate(tokens):
            iob_tag = self._classifier.predict([self._feature_detector(tokens, index, history)])[0]
            history.append(iob_tag)
            iob_tagged_tokens.append((word, tag, iob_tag))

        return conlltags2tree(iob_tagged_tokens)

    def score(self, parsed_sentences):
        """
        Compute the accuracy of the tagger for a list of test sentences
        :param parsed_sentences: List of parsed sentences: nltk.Tree
        :return: float 0.0 - 1.0
        """
        X_test, y_test = self.__class__.to_dataset(parsed_sentences, self._feature_detector)
        return self._classifier.score(X_test, y_test)

当我像这样调用分类器时:

ScikitLearnChunker.train(itertools.islice(reader, 5000), feature_detector=ner_features,all_classes=all_classes)

我收到错误:

    nextword, nextpos = tokens[index + 1]
IndexError: list index out of range

我的第一个问题是: 错误是否真的由函数ner_features生成? 如果是的话,为什么。

我检查了数据,一切都很好,我还检查了to_datase ans等可能的函数正在处理语料库数据。 我不知道在哪里检查,当然“nextword,nextpos = tokens [index + 1]”状态良好。我必须遗漏一些非常微不足道但又重要的东西,我只是不知道在哪里和什么。这可能是由于我缺乏关于在sklearn中加载数据的知识。

这里举例说明来自ner_features的一些传递参数:

[{'word': 'Thousands', 'lemma': 'thousand', 'pos': 'NNS', 'shape': 'capitalized', 'next-word': 'of', 'next-pos': 'IN', 'next-lemma': 'of', 'next-shape': 'lowercase', 'next-next-word': 'demonstrators', 'next-next-pos': 'NNS', 'next-next-lemma': 'demonstr', 'next-next-shape': 'lowercase', 'prev-word': '__START1__', 'prev-pos': '__START1__', 'prev-lemma': '__start1__', 'prev-iob': '__START1__', 'prev-shape': 'wildcard', 'prev-prev-word': '__START2__', 'prev-prev-pos': '__START2__', 'prev-prev-lemma': '__start2__', 'prev-prev-shape': 'wildcard'}]

请帮帮忙,我已经和他打交道了好几天,我不知道还能做什么。

更新 我设法改变了我的语料库的格式化,以便我可以用corpus_reader读取它。我认为问题在于将数据转换为数组。我没有得到错误“IndexError:list index out of range”。但现在似乎我的阵列没有被填充:

ValueError: Found array with 0 sample(s) (shape=(0, 315)) while a minimum of 1 is required.

我在这里做错了一些想法?我们你们对代码有了更多的见解,请告诉我。 感谢

1 个答案:

答案 0 :(得分:0)

更改

X.append(feature_detector(tagged, index, history=iob_tags[:index]))

X.append(feature_detector(list(zip(words, tags)), index, history=iob_tags[:index]))

并删除过时的

tagged = zip(words, tags)

我遇到了同样的问题,我认为代码存在2.7和3.5的问题。