不幸的是,我在从文本功能获取和可视化个人信息方面仍然遇到困难。 为了复制,我提供了一些数据(只是它们外观的示例)和代码。
Text Year #_of_characters_Subj
You won an amazing price!!! 2019 34
Dear John, I hope you are ready for this great news!!!!!!!:) 2020 67
It is awesome 2012 56
Address Spam
abc@gmail.com 1
ghi@yahoo.com 0
yes_we_can@live.com 1
代码:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=38)
categorical_encoder = OneHotEncoder(handle_unknown='ignore')
numerical_pipe = Pipeline([
('imputer', SimpleImputer(strategy='mean'))
])
text_transformer = Pipeline(steps=[
('CV',CountVectorizer())
])
# combine preprocessing with ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('text',text_transformer,'Text'),
('category', categorical_encoder, categorical_columns),
('numeric', numerical_pipe, numerical_columns)
])
# add model to be part of pipeline
rf = Pipeline([
('preprocess', preprocessor),
('classifier', LogisticRegression())
])
rf.fit(X_train, y_train)
# Get feature names
feature_names = rf['preprocess'].transformers_[0][1].get_feature_names()
coefs = rf.named_steps["classifier"].coef_.flatten()
zipped = zip(feature_names, coefs)
features_df = pd.DataFrame(zipped, columns=["feature", "value"])
features_df["ABS"] = features_df["value"].apply(lambda x: abs(x))
features_df["colors"] = features_df["value"].apply(lambda x: "green" if x > 0 else "red")
features_df = features_df.sort_values("ABS", ascending=False)
这最后一部分代码返回功能的元素,除了文本,我得到如下内容:
4201 x2_abc 0.120041 0.000241 green
1344 x0_You won an amazing price -0.000241 0.000241 red
2529 x0_Dear John, I hope you are ready for this great newss ... -0.000241 0.000241 red
我想得到最重要/最相关的词,而不是整个句子。我想问题可能是由于我在模型中拟合文本特征的方式,即 You won an amazing price
。
您对模型中特征的重要性以及如何解决这些问题有任何了解吗?