我正在尝试将NLTK中使用的dict类型功能与每个实例的SKLEARN tfidf功能结合起来。
示例输入: instances = [[“我正在处理文本数据”],[“这是我的第二句话”]] instance =“我正在处理文本数据”
def generate_features(instance):
featureset["suffix"]=tokenize(instance)[-1]
featureset["tfidf"]=self.tfidf.transform(instance)
return features
from sklearn.linear_model import LogisticRegressionCV
from nltk.classify.scikitlearn import SklearnClasskifier
self.classifier = SklearnClassifier(LogisticRegressionCV())
self.classifier.train(feature_sets)
此tfidf在所有实例上都经过培训。但是当我使用此功能集训练nltk分类器时,它会抛出以下错误。
self.classifier.train(feature_sets)
File "/Library/Python/2.7/site-packages/nltk/classify/scikitlearn.py", line 115, in train
X = self._vectorizer.fit_transform(X)
File "/Library/Python/2.7/site
packages/sklearn/feature_extraction/dict_vectorizer.py", line 226, in fit_transform
return self._transform(X, fitting=True)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 174, in _transform
values.append(dtype(v))
TypeError: float() argument must be a string or a number
我理解这里的问题,它无法对已经向量化的特征进行矢量化。但是有办法解决这个问题吗?
答案 0 :(得分:0)
对于那些将来可能会访问这个问题的人,我做了以下解决问题的事情。
from sklearn.linear_model import LogisticRegressionCV
from scipy.sparse import hstack
def generate_features(instance):
featureset["suffix"]=tokenize(instance)[-1]
return features
feature_sets=[(generate_features(instance),label) for instance in instances]
X = self.vec.fit_transform([item[0] for item in feature_sets]).toarray()
Y = [item[1] for item in feature_sets]
tfidf=TfidfVectorizer.fit_transform(instances)
X=hstack((X,tfidf))
classifier=LogisticRegressionCV()
classifier.fit(X,Y)
答案 1 :(得分:0)
我不知道是否有帮助。在我的例子中,featureset [“suffix”]的值必须是字符串或数字。例如:
featureset [“suffix”] =“some value”