如何利用随机森林进行分类模型中词袋的降维

时间:2017-12-05 07:04:42

标签: pandas scikit-learn nltk random-forest supervised-learning

我正在使用文本数据功能以及分类模型的其他数字功能。

如何在监督分类模型中将类似的词汇组合在一起。如何在计数后对相似的单词进行分组,我想减少单词包的维度。

我的代码

#Cleaning the Address Data
stopwords =nltk.corpus.stopwords.words('english')
data['Clean_addr'] = data['Adj_Addr'].apply(lambda x:"".join([item for item in x if item not in stopwords]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item for item in x if  not  item.isdigit()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item for item in x if item not in string.punctuation]))

#CountVectorizing the Address Data and fitting the sparse matrix to the Dataframe

cv = CountVectorizer( max_features = 1000,analyzer='word') 
cv_addr = cv.fit_transform(data.pop('Clean_addr'))
for i, col in enumerate(cv.get_feature_names()):
    data[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)

#LabelEncoding -Converting Catergocial to Numerical
data['Resi'] = LabelEncoder().fit_transform(data['Resi'])
data['Resi_Area'] = LabelEncoder().fit_transform(data['Resi_Area'])
data['Product'] = LabelEncoder().fit_transform(data['Product'])
data['Phone_Type'] = LabelEncoder().fit_transform(data['Phone_Type'])
data['Co_Name_FLag'] = LabelEncoder().fit_transform(data['Co_Name_FLag'])

#Classification
X_train, X_test, y_train, y_test = train_test_split(train, Y, test_size=0.3,random_state =8)
rus = RandomUnderSampler(random_state=42)
X_train_res, y_train_res = rus.fit_sample(X_train, y_train)
rf=RandomForestClassifier(n_estimators=1000,oob_score=True)

fit_rf=rf.fit(X_train_res,y_train_res)

感谢任何帮助。

1 个答案:

答案 0 :(得分:0)

如果你想减少单词包的内容,你可以使用sklearn中的SelectPercentile。以下是Iris数据的例子:

from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import chi2
import numpy
iris = load_iris()
X, y = iris.data, iris.target
selector = SelectPercentile(score_func=chi2, percentile=50)
X_reduced = selector.fit_transform(X, y)

您可以轻松地将其扩展为示例中的一个单词:

cv = CountVectorizer( max_features = 1000,analyzer='word') 
cv_addr = cv.fit_transform(data.pop('Clean_addr'))
selector = SelectPercentile(score_func=chi2, percentile=50)
X_reduced = selector.fit_transform(cv_addr, Y)

之后,您可以进行不同的试验,看看哪个百分位数效果最好,并最终按百分位数绘制得分,同时也将高得分的单词与其术语频率相关联,这里是这样一个条形图的示例:

enter image description here

祝你好运。