在模型中拟合文本特征以获得特征重要性的问题

时间:2021-05-14 16:19:19

标签: python machine-learning scikit-learn preprocessor feature-selection

不幸的是,我在从文本功能获取和可视化个人信息方面仍然遇到困难。 为了复制,我提供了一些数据(只是它们外观的示例)和代码。

Text                                                             Year #_of_characters_Subj
You won an amazing price!!!                                      2019  34
Dear John, I hope you are ready for this great news!!!!!!!:)     2020  67
It is awesome                                                    2012  56

Address                 Spam
abc@gmail.com             1
ghi@yahoo.com             0
yes_we_can@live.com       1

代码:

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=38)

categorical_encoder = OneHotEncoder(handle_unknown='ignore')
numerical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])
text_transformer =  Pipeline(steps=[
    ('CV',CountVectorizer())
]) 


# combine preprocessing with ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('text',text_transformer,'Text'), 
        ('category', categorical_encoder, categorical_columns),
        ('numeric', numerical_pipe, numerical_columns)
])

# add model to be part of pipeline


rf = Pipeline([
    ('preprocess', preprocessor),
    ('classifier', LogisticRegression())
])

rf.fit(X_train, y_train)

# Get feature names

feature_names = rf['preprocess'].transformers_[0][1].get_feature_names()
coefs = rf.named_steps["classifier"].coef_.flatten()

zipped = zip(feature_names, coefs)
features_df = pd.DataFrame(zipped, columns=["feature", "value"])
features_df["ABS"] = features_df["value"].apply(lambda x: abs(x))
features_df["colors"] = features_df["value"].apply(lambda x: "green" if x > 0 else "red")
features_df = features_df.sort_values("ABS", ascending=False)

这最后一部分代码返回功能的元素,除了文本,我得到如下内容:

4201    x2_abc  0.120041    0.000241    green
1344    x0_You won an amazing price -0.000241   0.000241    red
2529    x0_Dear John, I hope you are ready for this great newss ... -0.000241   0.000241    red

我想得到最重要/最相关的词,而不是整个句子。我想问题可能是由于我在模型中拟合文本特征的方式,即 You won an amazing price。 您对模型中特征的重要性以及如何解决这些问题有任何了解吗?

0 个答案:

没有答案
相关问题