Question

我将Spyder用作Anaconda的一部分，并尝试按事件类型对推文（文本）进行分类。为此，我使用包 cross_val_score ，已经使用 TfidVectorizer 将我的推文矢量化，然后使用 fit_transform 转换我的训练数据的字母组合，二元组和三元组，如下所示：

# TF-IDF on unigrams, bigrams and trigrams
tfidf_words = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin-1',
                              ngram_range=(1,1), stop_words='english')

# vectorize for bigrams
tfidf_bigrams = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin-1',
                              ngram_range=(2,2), stop_words='english')

# vecorize for trigrams
tfidf_trigrams = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin-1',
                              ngram_range=(3,3), stop_words='english')

# Transform and fit each of the outputs from TF-IDF (unigrams, bigrams and trigrams)
x_train_words = tfidf_words.fit_transform(x_train_sm.preprocessed).toarray()

# bigrams
x_train_bigrams = tfidf_bigrams.fit_transform(x_train_sm.preprocessed).toarray()

#trigrams
x_train_trigrams = tfidf_trigrams.fit_transform(x_train_sm.preprocessed).toarray()

现在，我使用包 cross_val_score 进行交叉验证，以计算单字组，双字组和三字组的平均准确度。完成后，我将尝试为实现的精度制作并保存一个箱形图。这已针对4种不同的型号完成：

# Create list of models to be tested: Random Forest, Linear SVC, Naive Bayes & Logistic Regression

models = [OneVsRestClassifier(RandomForestClassifier(n_estimators = 200, max_depth=3, random_state=0)), 
          OneVsRestClassifier(LinearSVC()), OneVsRestClassifier(MultinomialNB()), 
                              OneVsRestClassifier(LogisticRegression(random_state=0))]

# number of folds (10-fold cross validation performed for each model)
CV = 10

########## Fitting, predicting and calculating average accuracy for unigrams data ##########

# create blank dataframe with an index equal to the number of CV folds * number of models tested
cv_words = pd.DataFrame(index=range(CV * len(models)))

#create an empty list, which will be populated with the accuracies of each model at each fold
entries = []

# list of the names of the models tested
names = ["Random Forest", "Linear SVC", "Naive Bayes", "Logistic Regression"]

# convert y_train_sm from an array into a series to work in the 'cross_val_score' function
# this series contains all of the event_ids for the corresponding encoded tweets (labels)
# cross_val_score is a functin used to calculate performance scores and implement cross-validation
y_train_sm = pd.Series(y_train_sm.tolist())

### Fitting, predicting and calculating average accuracy for unigrams data ###

# calculate the accuracy at each fold and populate the results in the 'entries' list
# populate the dataframe 'cv_words' with the fold and accuracy scores at each fold
i = 0
for model in models:
    #model_name = #model.__class__.__name__
    model_name = names[i]
    # model => the model that will be used to fit the data
    # x_train_words_sm => x training data after oversampling (unigrams)
    # y_train_sm => y training data after oversampling (event_id)
    # scoring => the type of score you want the function 'cross_val_score' to return
    # cv = number of folds you want to be performed with cross-validation
    accuracies = cross_val_score(model, x_train_words, y_train_sm, scoring ='accuracy', cv=CV)
    for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))
        cv_words = pd.DataFrame(entries, columns=['model_name_unigrams', 'fold_idx', 'accuracy'])
    i = i + 1

# plot the results of each model on a single box plot    
box_words = sns.boxplot(x='model_name_unigrams', y='accuracy', data=cv_words)
fig_words = box_words.get_figure()
fig_words.savefig('boxplot_unigrams.png')

字母组合的输出正是我想要的：

现在，当我运行二元组和三元组的代码（突出显示所有代码并点击“播放”）时，我得到以下信息：

字母组合：

[ Boxplot for bigrams[2]

三字组：

每一个的代码都是相同的，除了它们使用“ cv_bigrams”和“ cv_trigrams”作为箱形图的数据输入。每个代码都在下面。

字母代码：

# create blank dataframe with an index equal to the number of CV folds * number of models tested
cv_bigrams = pd.DataFrame(index=range(CV * len(models)))

# clear the previous list called 'entries' that was populated with values
entries = []

# calculate the accuracy at each fold and populate the results in the 'entries' list
# populate the dataframe 'cv_bigrams' with the fold and accuracy score at each fold
i = 0
for model in models:
    #model_name = #model.__class__.__name__
    model_name = names[i]
    # model => the model that will be used to fit the data
    # x_train_bigrams_sm => x training data after oversampling (bigrams)
    # y_train_sm => y training data after oversampling (event_id)
    # scoring => the type of score you want the function 'cross_val_score' to return
    # cv = number of folds you want to performed with cross-validation
    accuracies = cross_val_score(model, x_train_bigrams, y_train_sm, scoring ='accuracy', cv=CV)
    for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))
        cv_bigrams = pd.DataFrame(entries, columns=['model_name_bigrams', 'fold_idx', 'accuracy'])
    i = i + 1

Trigrams代码：

# create blank dataframe with an index equal to number of CV folds * number of models tested
cv_trigrams = pd.DataFrame(index=range(CV * len(models)))

# clear the previous list called 'entries' that was populated with values
entries = []

# calculate the accuracy at each fold and populate the results in the 'entries' list
# populate the dataframe 'cv_trigrams' with the fold and accuracy score at each fold
i = 0
for model in models:
    #model_name = #model.__class__.__name__
    model_name = names[i]
    # model => the model that will be used to fit the data
    # x_train_trigrams => data that is to be fitted by the selected model (trigrams)
    # y_train_sm => y training data after oversampling (event_id)
    # scoring => the type of score you want the function 'cross_val_score' to return
    # cv = number of folds you want to performed with cross-validation
    accuracies = cross_val_score(model, x_train_trigrams, y_train_sm, scoring ='accuracy', cv=CV)
    for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))
        cv_trigrams = pd.DataFrame(entries, columns=['model_name_trigrams', 'fold_idx', 'accuracy'])
    i = i + 1

如果仅选择以下代码并运行，则会发生以下情况：

# plot the results of each model as a box plot    
box_bigrams = sns.boxplot(x='model_name_bigrams', y='accuracy', data=cv_bigrams)
box_bigrams = sns.boxplot(x='model_name_bigrams', y='accuracy', data=cv_bigrams)
fig_bigrams = box_bigrams.get_figure()
fig_bigrams.savefig('boxplot_bigrams.png')

对于三连词相同：

# plot the results of each model as a box plot    
box_trigrams = sns.boxplot(x='model_name_trigrams', y='accuracy', data=cv_trigrams)
box_trigrams = sns.boxplot(x='model_name_trigrams', y='accuracy', data=cv_trigrams)
fig_trigrams = box_trigrams.get_figure()
fig_trigrams.savefig('boxplot_trigrams.png')

输出：

有什么主意，为什么我一次运行所有代码时会得到重复的箱形图重叠（当我将这些代码投入生产时需要这样做），而不是突出显示片段并单独运行？

Answer 1

呼应@ImportanceOfBeingErnest的注释，您的代码太复杂，您的问题还不够清楚。您是否要创建3个不同的数字，每个案例一个（unigram，bigram和trigram）？您是否要制作一个具有3个轴的图形（在matplotlib中称为子图）？您是否正在尝试将三个案例并排放置一个图形？

对我来说，最简单的方法是创建一个带有3个子图的图形，如下所示：

fig, (ax1, ax2, ax3) = plt.subplots(3,1, figsize=(xx,yy))  # choose appropriate size to fit your needs
sns.boxplot(x='model_name_unigrams', y='accuracy', data=cv_unigrams, ax=ax1)
sns.boxplot(x='model_name_bigrams', y='accuracy', data=cv_bigrams, ax=ax2)
sns.boxplot(x='model_name_trigrams', y='accuracy', data=cv_trigrams, ax=ax3)
fig.savefig('your_figure_name_here.png')

请参阅subplots demo here和有关plt.subplots()或fig.add_subplot()的文档。在the documentation for seaborn.boxplot()中，您将看到它是一个“轴级”功能，这意味着您可以要求它在您选择的任何轴对象上绘图

使用Seaborn在Python中绘制箱形图-为二元组和三元组创建重复项

1 个答案: