我正在努力将FastText(FTTransformer)实施到在不同矢量化程序上迭代的管道中。更具体地说,我无法获得交叉验证分数。使用以下代码:
%%time
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
from gensim.sklearn_api.ftmodel import FTTransformer
np.random.seed(0)
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split(data.text, data.label, random_state=0)
w2v_texts = [simple_preprocess(doc) for doc in X_train]
models = [FTTransformer(size=10, min_count=0, seed=42)]
classifiers = [LogisticRegression(random_state=0)]
for model in models:
for classifier in classifiers:
model.fit(w2v_texts)
classifier.fit(model.transform(X_train), y_train)
pipeline = Pipeline([
('vec', model),
('clf', classifier)
])
print(pipeline.score(X_train, y_train))
#print(model.gensim_model.wv.most_similar('kirk'))
cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
KeyError:“单词的所有ngram机器学习都可能有用 有时“没有品牌”
该问题如何解决?
旁注:我其他使用D2VTransformer
或TfIdfVectorizer
的管道也可以正常工作。在这里,我可以在定义管道之后简单地应用pipeline.fit(X_train, y_train)
,而不是上面的两个拟合。似乎FTTransformer与其他给定的矢量化器集成得不是很好吗?
答案 0 :(得分:1)
是的,要在管道中使用,FTTransformer
需要进行修改,以将文档拆分为其fit
方法中的单词。一个人可以做到如下:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
from gensim.sklearn_api.ftmodel import FTTransformer
np.random.seed(0)
class FTTransformer2(FTTransformer):
def fit(self, x, y):
super().fit([simple_preprocess(doc) for doc in x])
return self
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split(data.text, data.label, random_state=0)
classifiers = [LogisticRegression(random_state=0)]
for classifier in classifiers:
pipeline = Pipeline([
('ftt', FTTransformer2(size=10, min_count=0, seed=0)),
('clf', classifier)
])
score = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
print(score)