如何使用遗传算法（GA）在SMS垃圾邮件/火腿数据集中进行特征选择

时间：2019-05-29 07:59:23

标签： machine-learning scikit-learn spam-prevention

我正在一个学校的ML项目中检测垃圾短信，所以我有我的数据集，我现在的问题是能够使用遗传算法从数据集中选择特征，我真的不知道该如何实现，并且必须根据项目说明进行操作，请给我帮助。

import string
from nltk.corpus import stopwords

def textPreProcess(text):
    punctuationsToNone = str.maketrans('', '', string.punctuation)
    text = text.translate(punctuationsToNone)
    text = [word for word in text.split() if word.lower() not in stopwords.words("english")]
    return " ".join(text)

使用遗传算法进行特征提取

从nltk.stem导入SnowballStemmer 导入NLTK nltk.download（“停用词”）

def词干（文字）：单词=地图（lambda t：SnowballStemmer（英语）.stem（t），text.split（））返回“” .join（words）

我们将消息的副本复制到新的DataFrame中

texts = sms ['message']。copy（）

# And apply the pre-processing methods to the new DataFrame
texts = texts.apply(textPreProcess)
texts = texts.apply(stemming)

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer("english")
features = vectorizer.fit_transform(texts)

这是我在功能选择方面的进步，但是我认为做错了事。

0 个答案:

没有答案