我有一个数据集,我在其中使用文本列来预测某个数字列。
我的最终问题是:文本列中的哪些单词与得分更高/更低有关?
因此,我的流程是首先对文本列进行向量化,然后使用岭回归。但是,在构建了此管道之后,如何提取矢量化器功能名称上的系数?
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
# This is my toy data
d = {'text': ["I am a a string", "And I am a string", "I, too am string", "And me", "Me too"],
'target': [3, 4, 14, 6, 7]}
df = pd.DataFrame(d)
X_train, X_test, y_train, y_test= train_test_split(df['text'], df['target'],
test_size=0.3, random_state=42)
# Here is a vectorizer
vect = TfidfVectorizer(stop_words='english')
X_train_vect = vect.fit_transform(X_train)
# Here is a ridge regressor
model = Ridge(random_state=42)
model.fit(X_train_vect, y_train)
# Now we make a pipeline
pipe = Pipeline([('vect',vect),('model',model)])
y_pred = pipe.predict(X_test)
我该如何从这里提取单词作为系数呢?
例如:"I am": 0.05
或其他
答案 0 :(得分:0)
idf = vect.idf_
print (dict(zip(vect.get_feature_names(), idf)))
这应该做到。
答案 1 :(得分:0)
您没有以最佳方式使用pipeline
。您可以按照以下说明使用流水线本身来完成.fit()
。
# Here is a vectorizer
vect = TfidfVectorizer(stop_words='english')
# Here is a ridge regressor
model = Ridge(random_state=42)
# Now we make a pipeline
pipe = Pipeline([('vect',vect),('model',model)])
pipe.fit(X_train, y_train)
pipe.predict(X_test)
# array([8.07176068, 7.21966856])
现在,要知道与每个特征相对应的系数,请使用:
# for sklearn >= 0.21.0
list(zip(pipe['vect'].get_feature_names(), pipe['model'].coef_ ))
# for sklearn < 0.21.0
list(zip(pipe.named_steps.vect.get_feature_names(), pipe.named_steps.model.coef_ ))