如何提取TF-IDF特征的系数?

时间:2019-07-03 11:55:45

标签: scikit-learn nlp

我有一个数据集,我在其中使用文本列来预测某个数字列。

我的最终问题是:文本列中的哪些单词与得分更高/更低有关?

因此,我的流程是首先对文本列进行向量化,然后使用岭回归。但是,在构建了此管道之后,如何提取矢量化器功能名称上的系数?

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

# This is my toy data 
d = {'text': ["I am a a string", "And I am a string", "I, too am string", "And me", "Me too"], 
     'target': [3, 4, 14, 6, 7]}
df = pd.DataFrame(d)

X_train, X_test, y_train, y_test= train_test_split(df['text'], df['target'], 
                                                   test_size=0.3, random_state=42)


# Here is a vectorizer 
vect = TfidfVectorizer(stop_words='english')
X_train_vect = vect.fit_transform(X_train)

# Here is a ridge regressor
model = Ridge(random_state=42)
model.fit(X_train_vect, y_train)

# Now we make a pipeline
pipe = Pipeline([('vect',vect),('model',model)])
y_pred = pipe.predict(X_test)

我该如何从这里提取单词作为系数呢? 例如:"I am": 0.05或其他

2 个答案:

答案 0 :(得分:0)

idf = vect.idf_
print (dict(zip(vect.get_feature_names(), idf)))

这应该做到。

答案 1 :(得分:0)

您没有以最佳方式使用pipeline。您可以按照以下说明使用流水线本身来完成.fit()

# Here is a vectorizer 
vect = TfidfVectorizer(stop_words='english')

# Here is a ridge regressor
model = Ridge(random_state=42)

# Now we make a pipeline
pipe = Pipeline([('vect',vect),('model',model)])
pipe.fit(X_train, y_train)
pipe.predict(X_test)

# array([8.07176068, 7.21966856])

现在,要知道与每个特征相对应的系数,请使用:

# for sklearn >= 0.21.0
list(zip(pipe['vect'].get_feature_names(), pipe['model'].coef_ ))

# for sklearn < 0.21.0
list(zip(pipe.named_steps.vect.get_feature_names(), pipe.named_steps.model.coef_ ))