部分语音标记NLTK SVM

时间:2016-03-31 04:10:00

标签: python machine-learning nlp nltk svm

我正在尝试使用支持向量机(SVM)在NLTK中编写一个词性(PoS)标记器。我可以写一个

中给出的分类器
>>> #CLASSIFICATION USING SUPPORT VECTOR MACHINE
>>> import nltk
>>> from nltk.classify import SklearnClassifier
>>> from sklearn.naive_bayes import BernoulliNB
>>> from nltk.tokenize import word_tokenize
>>> from sklearn.svm import SVC
>>> #TRAINING AND TEST DATA
>>> train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]
>>> test = [
        ('The beer was good.', 'pos'),
        ('I do not enjoy my job', 'neg'),
        ("I ain't feeling dandy today.", 'neg'),
        ("I feel amazing!", 'pos'),
        ('Gary is a friend of mine.', 'pos'),
        ("I can't believe I'm doing this.", 'neg')]
>>> test_sentence = "This is the best band I've ever heard!"
>>> #FEATURESETS
>>> all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
>>> t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]
>>> testf=[({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in test]
>>> test_sent_features = {word.lower(): (word in word_tokenize(test_sentence.lower())) for word in all_words}
>>> #CLASSIFICATION
>>> #SUPPORT VECTOR MACHINE
>>> classif1 = SklearnClassifier(SVC(), sparse=False).train(t)
>>> classif1.classify(test_sent_features)
'neg'

我尝试过使用

from nltk.tag.sequential import ClassifierBasedPOSTagger

还有,

from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

但没有太多帮助。 似乎NLTK也有一个SVM分类器,如

中所示
from nltk.classify import svm 

但不知何故无法连接。

我按照以下方式尝试了它们并收到了相应的错误报告。

>>> import nltk
>>> from nltk.classify import SklearnClassifier
>>> from sklearn.naive_bayes import BernoulliNB
>>> from nltk.tokenize import word_tokenize
>>> from sklearn.svm import SVC
>>> from nltk.corpus import brown
>>> brown_trd=brown.tagged_sents()[:300]
>>> from nltk.tag.sequential import ClassifierBasedPOSTagger
>>> clf = SklearnClassifier(SVC(),sparse=False)
>>> svm_tagger = ClassifierBasedPOSTagger(train=brown_trd,
classifier_builder=clf.train)

Traceback (most recent call last):
  File "<pyshell#9>", line 2, in <module>
    classifier_builder=clf.train)
  File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 630, in __init__
    self._train(train, classifier_builder, verbose)
  File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 667, in _train
    self._classifier = classifier_builder(classifier_corpus)
  File "C:\Python27\lib\site-packages\nltk\classify\scikitlearn.py", line 115, in train
    X = self._vectorizer.fit_transform(X)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 226, in fit_transform
    return self._transform(X, fitting=True)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 174, in _transform
    values.append(dtype(v))
TypeError: float() argument must be a string or a number

>>> from nltk.classify import svm
>>> svm_tagger = ClassifierBasedPOSTagger(train=brown_trd,
classifier_builder=svm.train)

Traceback (most recent call last):
  File "<pyshell#11>", line 2, in <module>
    classifier_builder=svm.train)
AttributeError: 'module' object has no attribute 'train'

>>> from sklearn.feature_extraction import DictVectorizer
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> train_data=TfidfTransformer(brown_trd)
>>> train_data1=DictVectorizer(brown_trd)
>>> svm_tagger = ClassifierBasedPOSTagger(train=train_data,
classifier_builder=svm.train)

Traceback (most recent call last):
  File "<pyshell#16>", line 2, in <module>
    classifier_builder=svm.train)
AttributeError: 'module' object has no attribute 'train'

>>> svm_tagger = ClassifierBasedPOSTagger(train=train_data1,
classifier_builder=svm.train)

Traceback (most recent call last):
  File "<pyshell#17>", line 2, in <module>
    classifier_builder=svm.train)
AttributeError: 'module' object has no attribute 'train'
>>> svm_tagger = ClassifierBasedPOSTagger(train=train_data,
classifier_builder=clf.train)

Traceback (most recent call last):
  File "<pyshell#18>", line 2, in <module>
    classifier_builder=clf.train)
  File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 630, in __init__
    self._train(train, classifier_builder, verbose)
  File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 656, in _train
    for sentence in tagged_corpus:
TypeError: 'TfidfTransformer' object is not iterable
>>> svm_tagger = ClassifierBasedPOSTagger(train=train_data1,
classifier_builder=clf.train)

Traceback (most recent call last):
  File "<pyshell#19>", line 2, in <module>
    classifier_builder=clf.train)
  File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 630, in __init__
    self._train(train, classifier_builder, verbose)
  File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 656, in _train
    for sentence in tagged_corpus:
TypeError: 'DictVectorizer' object is not iterable
>>> svm_tagger=ClassifierBasedPOSTagger(train=train_data,classifier_builder
= lambda train_feats: clf.train(train_data))

Traceback (most recent call last):
  File "<pyshell#20>", line 2, in <module>
    = lambda train_feats: clf.train(train_data))
  File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 630, in __init__
    self._train(train, classifier_builder, verbose)
  File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 656, in _train
    for sentence in tagged_corpus:
TypeError: 'TfidfTransformer' object is not iterable
>>> vm_tagger=ClassifierBasedPOSTagger(train=train_data1,classifier_builder
= lambda train_feats: clf.train(train_data1))

Traceback (most recent call last):
  File "<pyshell#21>", line 2, in <module>
    = lambda train_feats: clf.train(train_data1))
  File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 630, in __init__
    self._train(train, classifier_builder, verbose)
  File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 656, in _train
    for sentence in tagged_corpus:
TypeError: 'DictVectorizer' object is not iterable
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count_vect = CountVectorizer()
>>> X_train_counts = count_vect.fit_transform(brown_trd)

Traceback (most recent call last):
  File "<pyshell#24>", line 1, in <module>
    X_train_counts = count_vect.fit_transform(brown_trd)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 817, in fit_transform
    self.fixed_vocabulary_)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 752, in _count_vocab
    for feature in analyze(doc):
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 238, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 204, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'
>>> X_train_counts = count_vect.fit_transform(train_data)

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    X_train_counts = count_vect.fit_transform(train_data)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 817, in fit_transform
    self.fixed_vocabulary_)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 751, in _count_vocab
    for doc in raw_documents:
TypeError: 'TfidfTransformer' object is not iterable
>>> X_train_counts = count_vect.fit_transform(train_data1)

Traceback (most recent call last):
  File "<pyshell#26>", line 1, in <module>
    X_train_counts = count_vect.fit_transform(train_data1)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 817, in fit_transform
    self.fixed_vocabulary_)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 751, in _count_vocab
    for doc in raw_documents:
TypeError: 'DictVectorizer' object is not iterable
>>> X_new_counts = count_vect.transform(brown_trd)

Traceback (most recent call last):
  File "<pyshell#27>", line 1, in <module>
    X_new_counts = count_vect.transform(brown_trd)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 863, in transform
    self._check_vocabulary()
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 275, in _check_vocabulary
    check_is_fitted(self, 'vocabulary_', msg=msg),
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 678, in check_is_fitted
    raise NotFittedError(msg % {'name': type(estimator).__name__})
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
>>> X_new_counts = count_vect.transform(train_data)

Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    X_new_counts = count_vect.transform(train_data)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 863, in transform
    self._check_vocabulary()
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 275, in _check_vocabulary
    check_is_fitted(self, 'vocabulary_', msg=msg),
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 678, in check_is_fitted
    raise NotFittedError(msg % {'name': type(estimator).__name__})
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
>>> X_new_counts = count_vect.transform(train_data1)

Traceback (most recent call last):
  File "<pyshell#29>", line 1, in <module>
    X_new_counts = count_vect.transform(train_data1)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 863, in transform
    self._check_vocabulary()
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 275, in _check_vocabulary
    check_is_fitted(self, 'vocabulary_', msg=msg),
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 678, in check_is_fitted
    raise NotFittedError(msg % {'name': type(estimator).__name__})
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
>>> 

我觉得我需要将 brown.tagged_sents()转换为sklearn格式,我失败了,NLTK SVM还没有完全开发。

如果有人可能会建议如何解决此问题。

0 个答案:

没有答案