我在SciKit-Learn中加载自定义数据时遇到问题,以便使用分类器通过Python查找命名实体。我不得不说我是新手使用Scikit-Learn,我想我需要一个numpy-array作为输入,如果我是对的。
所以,这是我的问题:我有Collntag-Format训练数据:
Where WRB O
the DT O
disposer NN B-Per
is VBZ O
a DT O
non-legal JJ B-Per
entity NN I-Per
, , O
the DT O
identifier NN B-Per
specified VBN O
in IN O
Article NNP B-Law
7 CD O
shall MD O
be VB O
used VBN O
. . O
现在我已经使用以下示例帮助构建了我的分类器: 1)我有一个Feature_detector函数
stemmer = SnowballStemmer('english')
def ner_features(tokens, index, history):
"""
`tokens` = a POS-tagged sentence [(w1, t1), ...]
`index` = the index of the token we want to extract features for
`history` = the previous predicted IOB tags
"""
tokens = [('[START2]', '[START2]'), ('[START1]', '[START1]')] + list(tokens) + [('[END1]', '[END1]'), ('[END2]', '[END2]')]
history = ['[START2]', '[START1]'] + list(history)
# shift the index with 2, to accommodate the padding
index += 2
feat_dict = {
'word': word,
'lemma': stemmer.stem(word),
'pos': pos,
'shape': shape(word),
'next-word': nextword,
'next-pos': nextpos,
'next-lemma': stemmer.stem(nextword),
'next-shape': shape(nextword),
'next-next-word': nextnextword,
'next-next-pos': nextnextpos,
'next-next-lemma': stemmer.stem(nextnextword),
'next-next-shape': shape(nextnextword),
'prev-word': prevword,
'prev-pos': prevpos,
'prev-lemma': stemmer.stem(prevword),
'prev-iob': previob,
'prev-shape': shape(prevword),
'prev-prev-word': prevprevword,
'prev-prev-pos': prevprevpos,
'prev-prev-lemma': stemmer.stem(prevprevword),
'prev-prev-iob': prevpreviob,
'prev-prev-shape': shape(prevprevword),
}
return feat_dict
2)我有一个来自sklearn的Perceptron-Classifier
class ScikitLearnChunker(ChunkParserI):
@classmethod
def to_dataset(cls, parsed_sentences, feature_detector):
"""
Transform a list of tagged sentences into a scikit-learn compatible POS dataset
:param parsed_sentences:
:param feature_detector:
:return:
"""
X, y = [], []
for parsed in parsed_sentences:
iob_tagged = tree2conlltags(parsed)
words, tags, iob_tags = zip(*iob_tagged)
tagged = zip(words, tags)
for index in range(len(iob_tagged)):
X.append(feature_detector(tagged, index, history=iob_tags[:index]))
y.append(iob_tags[index])
return X, y
@classmethod
def get_minibatch(cls, parsed_sentences, feature_detector, batch_size=500):
batch = list(itertools.islice(parsed_sentences, batch_size))
X, y = cls.to_dataset(batch, feature_detector)
return X, y
@classmethod
def train(cls, parsed_sentences, feature_detector, all_classes, **kwargs):
X, y = cls.get_minibatch(parsed_sentences, feature_detector, kwargs.get('batch_size', 500))
vectorizer = DictVectorizer(sparse=False)
vectorizer.fit(X)
clf = Perceptron(verbose=10, n_jobs=-1, n_iter=kwargs.get('n_iter', 5))
while len(X):
X = vectorizer.transform(X)
clf.partial_fit(X, y, all_classes)
X, y = cls.get_minibatch(parsed_sentences, feature_detector, kwargs.get('batch_size', 500))
clf = Pipeline([
('vectorizer', vectorizer),
('classifier', clf)
])
return cls(clf, feature_detector)
def __init__(self, classifier, feature_detector):
self._classifier = classifier
self._feature_detector = feature_detector
def parse(self, tokens):
"""
Chunk a tagged sentence
:param tokens: List of words [(w1, t1), (w2, t2), ...]
:return: chunked sentence: nltk.Tree
"""
history = []
iob_tagged_tokens = []
for index, (word, tag) in enumerate(tokens):
iob_tag = self._classifier.predict([self._feature_detector(tokens, index, history)])[0]
history.append(iob_tag)
iob_tagged_tokens.append((word, tag, iob_tag))
return conlltags2tree(iob_tagged_tokens)
def score(self, parsed_sentences):
"""
Compute the accuracy of the tagger for a list of test sentences
:param parsed_sentences: List of parsed sentences: nltk.Tree
:return: float 0.0 - 1.0
"""
X_test, y_test = self.__class__.to_dataset(parsed_sentences, self._feature_detector)
return self._classifier.score(X_test, y_test)
当我像这样调用分类器时:
ScikitLearnChunker.train(itertools.islice(reader, 5000), feature_detector=ner_features,all_classes=all_classes)
我收到错误:
nextword, nextpos = tokens[index + 1]
IndexError: list index out of range
我的第一个问题是:
错误是否真的由函数ner_features
生成?
如果是的话,为什么。
我检查了数据,一切都很好,我还检查了to_datase
ans等可能的函数正在处理语料库数据。
我不知道在哪里检查,当然“nextword,nextpos = tokens [index + 1]”状态良好。我必须遗漏一些非常微不足道但又重要的东西,我只是不知道在哪里和什么。这可能是由于我缺乏关于在sklearn中加载数据的知识。
这里举例说明来自ner_features
的一些传递参数:
[{'word': 'Thousands', 'lemma': 'thousand', 'pos': 'NNS', 'shape': 'capitalized', 'next-word': 'of', 'next-pos': 'IN', 'next-lemma': 'of', 'next-shape': 'lowercase', 'next-next-word': 'demonstrators', 'next-next-pos': 'NNS', 'next-next-lemma': 'demonstr', 'next-next-shape': 'lowercase', 'prev-word': '__START1__', 'prev-pos': '__START1__', 'prev-lemma': '__start1__', 'prev-iob': '__START1__', 'prev-shape': 'wildcard', 'prev-prev-word': '__START2__', 'prev-prev-pos': '__START2__', 'prev-prev-lemma': '__start2__', 'prev-prev-shape': 'wildcard'}]
请帮帮忙,我已经和他打交道了好几天,我不知道还能做什么。
更新 我设法改变了我的语料库的格式化,以便我可以用corpus_reader读取它。我认为问题在于将数据转换为数组。我没有得到错误“IndexError:list index out of range”。但现在似乎我的阵列没有被填充:
ValueError: Found array with 0 sample(s) (shape=(0, 315)) while a minimum of 1 is required.
我在这里做错了一些想法?我们你们对代码有了更多的见解,请告诉我。 感谢
答案 0 :(得分:0)
更改
X.append(feature_detector(tagged, index, history=iob_tags[:index]))
到
X.append(feature_detector(list(zip(words, tags)), index, history=iob_tags[:index]))
并删除过时的
tagged = zip(words, tags)
我遇到了同样的问题,我认为代码存在2.7和3.5的问题。