我将Spyder用作Anaconda的一部分,并尝试按事件类型对推文(文本)进行分类。为此,我使用包 cross_val_score ,已经使用 TfidVectorizer 将我的推文矢量化,然后使用 fit_transform 转换我的训练数据的字母组合,二元组和三元组,如下所示:
# TF-IDF on unigrams, bigrams and trigrams
tfidf_words = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin-1',
ngram_range=(1,1), stop_words='english')
# vectorize for bigrams
tfidf_bigrams = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin-1',
ngram_range=(2,2), stop_words='english')
# vecorize for trigrams
tfidf_trigrams = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin-1',
ngram_range=(3,3), stop_words='english')
# Transform and fit each of the outputs from TF-IDF (unigrams, bigrams and trigrams)
x_train_words = tfidf_words.fit_transform(x_train_sm.preprocessed).toarray()
# bigrams
x_train_bigrams = tfidf_bigrams.fit_transform(x_train_sm.preprocessed).toarray()
#trigrams
x_train_trigrams = tfidf_trigrams.fit_transform(x_train_sm.preprocessed).toarray()
现在,我使用包 cross_val_score 进行交叉验证,以计算单字组,双字组和三字组的平均准确度。完成后,我将尝试为实现的精度制作并保存一个箱形图。这已针对4种不同的型号完成:
# Create list of models to be tested: Random Forest, Linear SVC, Naive Bayes & Logistic Regression
models = [OneVsRestClassifier(RandomForestClassifier(n_estimators = 200, max_depth=3, random_state=0)),
OneVsRestClassifier(LinearSVC()), OneVsRestClassifier(MultinomialNB()),
OneVsRestClassifier(LogisticRegression(random_state=0))]
# number of folds (10-fold cross validation performed for each model)
CV = 10
########## Fitting, predicting and calculating average accuracy for unigrams data ##########
# create blank dataframe with an index equal to the number of CV folds * number of models tested
cv_words = pd.DataFrame(index=range(CV * len(models)))
#create an empty list, which will be populated with the accuracies of each model at each fold
entries = []
# list of the names of the models tested
names = ["Random Forest", "Linear SVC", "Naive Bayes", "Logistic Regression"]
# convert y_train_sm from an array into a series to work in the 'cross_val_score' function
# this series contains all of the event_ids for the corresponding encoded tweets (labels)
# cross_val_score is a functin used to calculate performance scores and implement cross-validation
y_train_sm = pd.Series(y_train_sm.tolist())
### Fitting, predicting and calculating average accuracy for unigrams data ###
# calculate the accuracy at each fold and populate the results in the 'entries' list
# populate the dataframe 'cv_words' with the fold and accuracy scores at each fold
i = 0
for model in models:
#model_name = #model.__class__.__name__
model_name = names[i]
# model => the model that will be used to fit the data
# x_train_words_sm => x training data after oversampling (unigrams)
# y_train_sm => y training data after oversampling (event_id)
# scoring => the type of score you want the function 'cross_val_score' to return
# cv = number of folds you want to be performed with cross-validation
accuracies = cross_val_score(model, x_train_words, y_train_sm, scoring ='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_words = pd.DataFrame(entries, columns=['model_name_unigrams', 'fold_idx', 'accuracy'])
i = i + 1
# plot the results of each model on a single box plot
box_words = sns.boxplot(x='model_name_unigrams', y='accuracy', data=cv_words)
fig_words = box_words.get_figure()
fig_words.savefig('boxplot_unigrams.png')
字母组合的输出正是我想要的:
现在,当我运行二元组和三元组的代码(突出显示所有代码并点击“播放”)时,我得到以下信息:
字母组合:
[
三字组:
每一个的代码都是相同的,除了它们使用“ cv_bigrams”和“ cv_trigrams”作为箱形图的数据输入。每个代码都在下面。
字母代码:
# create blank dataframe with an index equal to the number of CV folds * number of models tested
cv_bigrams = pd.DataFrame(index=range(CV * len(models)))
# clear the previous list called 'entries' that was populated with values
entries = []
# calculate the accuracy at each fold and populate the results in the 'entries' list
# populate the dataframe 'cv_bigrams' with the fold and accuracy score at each fold
i = 0
for model in models:
#model_name = #model.__class__.__name__
model_name = names[i]
# model => the model that will be used to fit the data
# x_train_bigrams_sm => x training data after oversampling (bigrams)
# y_train_sm => y training data after oversampling (event_id)
# scoring => the type of score you want the function 'cross_val_score' to return
# cv = number of folds you want to performed with cross-validation
accuracies = cross_val_score(model, x_train_bigrams, y_train_sm, scoring ='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_bigrams = pd.DataFrame(entries, columns=['model_name_bigrams', 'fold_idx', 'accuracy'])
i = i + 1
Trigrams代码:
# create blank dataframe with an index equal to number of CV folds * number of models tested
cv_trigrams = pd.DataFrame(index=range(CV * len(models)))
# clear the previous list called 'entries' that was populated with values
entries = []
# calculate the accuracy at each fold and populate the results in the 'entries' list
# populate the dataframe 'cv_trigrams' with the fold and accuracy score at each fold
i = 0
for model in models:
#model_name = #model.__class__.__name__
model_name = names[i]
# model => the model that will be used to fit the data
# x_train_trigrams => data that is to be fitted by the selected model (trigrams)
# y_train_sm => y training data after oversampling (event_id)
# scoring => the type of score you want the function 'cross_val_score' to return
# cv = number of folds you want to performed with cross-validation
accuracies = cross_val_score(model, x_train_trigrams, y_train_sm, scoring ='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_trigrams = pd.DataFrame(entries, columns=['model_name_trigrams', 'fold_idx', 'accuracy'])
i = i + 1
如果仅选择以下代码并运行,则会发生以下情况:
# plot the results of each model as a box plot
box_bigrams = sns.boxplot(x='model_name_bigrams', y='accuracy', data=cv_bigrams)
box_bigrams = sns.boxplot(x='model_name_bigrams', y='accuracy', data=cv_bigrams)
fig_bigrams = box_bigrams.get_figure()
fig_bigrams.savefig('boxplot_bigrams.png')
对于三连词相同:
# plot the results of each model as a box plot
box_trigrams = sns.boxplot(x='model_name_trigrams', y='accuracy', data=cv_trigrams)
box_trigrams = sns.boxplot(x='model_name_trigrams', y='accuracy', data=cv_trigrams)
fig_trigrams = box_trigrams.get_figure()
fig_trigrams.savefig('boxplot_trigrams.png')
输出:
有什么主意,为什么我一次运行所有代码时会得到重复的箱形图重叠(当我将这些代码投入生产时需要这样做),而不是突出显示片段并单独运行?
答案 0 :(得分:0)
呼应@ImportanceOfBeingErnest的注释,您的代码太复杂,您的问题还不够清楚。您是否要创建3个不同的数字,每个案例一个(unigram,bigram和trigram)?您是否要制作一个具有3个轴的图形(在matplotlib中称为子图)?您是否正在尝试将三个案例并排放置一个图形?
对我来说,最简单的方法是创建一个带有3个子图的图形,如下所示:
fig, (ax1, ax2, ax3) = plt.subplots(3,1, figsize=(xx,yy)) # choose appropriate size to fit your needs
sns.boxplot(x='model_name_unigrams', y='accuracy', data=cv_unigrams, ax=ax1)
sns.boxplot(x='model_name_bigrams', y='accuracy', data=cv_bigrams, ax=ax2)
sns.boxplot(x='model_name_trigrams', y='accuracy', data=cv_trigrams, ax=ax3)
fig.savefig('your_figure_name_here.png')
请参阅subplots demo here和有关plt.subplots()
或fig.add_subplot()
的文档。在the documentation for seaborn.boxplot()
中,您将看到它是一个“轴级”功能,这意味着您可以要求它在您选择的任何轴对象上绘图