下面是我的代码。

Question

我已经完成了一种机器学习算法，可以根据文本对类别进行分类。我完成了99％，但是我现在知道将我的预测结果合并回原始数据帧，以查看我开始时的内容以及预测的内容的打印视图。

下面是我的代码。

#imports data from excel file and shows first 5 rows of data
file_name = r'C:\Users\aac1928\Documents\Machine Learning\Training        Data\RFP Training Data.xlsx'
sheet = 'Sheet1'

import pandas as pd
import numpy
import xlsxwriter
import sklearn

df = pd.read_excel(io=file_name,sheet_name=sheet)

#extracts specifics rows from data 
data = df.iloc[: , [0,2]]
print(data)

#Gets data ready for model
newdata = df.iloc[:,[1,2]]
newdata = newdata.rename(columns={'Label':'label'})
newdata = newdata.rename(columns={'RFP Question':'question'})
print(newdata)

# how to define X and yfor use with COUNTVECTORIZER
X = newdata.question
y = newdata.label
print(X.shape)
print(y.shape)

# split X and y into training and testing sets
X_train = X
y_train = y
X_test = newdata.question[:50]
y_test = newdata.label[:50]
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# equivalently: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)

# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)

# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)
y_pred_class

# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)

这是我添加的新数据，用于根据与数组相同的长度进行预测

# split X and y into training and testing sets
X_train = X
y_train = y
X_testnew = dfpred.question
y_testnew = dfpred.label
print(X_train.shape)
print(X_testnew.shape)
print(y_train.shape)
print(y_testnew.shape)

（447，）（168，）（447，）（168，）

# transform new testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm_new = vect.transform(X_testnew)
X_test_dtm_new

<168x1382类型的稀疏矩阵带有2240个压缩后的稀疏行格式的存储元素>

# make class predictions for new X_test_dtm
y_pred_class_new = nb.predict(X_test_dtm_new)
y_pred_class_new

array（[3，3，19，18，5，10，10，5，19，3，3，3，5，3，3，3，3， 9，19，5，5，10，9，5，18，19，9，9，19，19，18，18，18，4， 18、3、9、18、19、19、18、19、5、19、19、3、3、18、18、5、18， 3、4、5、6、4、5、19、19、5、5、19、19、4、5、18、5、5 19、5、18、5、19、18、19、5、7、5、9、9、9、9、10、9、9 5、5、5、5、3、18、4、9、5、3、6、9、18、7、5、9、5 5、19、5、5、19、5、6、5、5、6、9、21、10、9、18、9、9 3，18，5，6，6，18，6，3，6，5，18，6，5，18，5，6，7，7， 5，7，19，18，6，5，5，5，5，5，19，16，5，19，5，5，5， [5，19，5，7，19，6，7，3，18，18，18，6，19，19，7]， dtype = int64）

# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob_new = logreg.predict_proba(X_test_dtm_new)[:, 1]
y_pred_prob_new

df['prediction'] = pd.Series(y_pred_class_new)

dfout = pd.merge(dfpred,df['prediction'].dropna() .to_frame(),how = 'left',left_index = True,   right_index = True)

print（dfout）

我希望这可以帮助我尽量保持清晰

Answer 1

我认为，由于您的预测只是一个数组，因此最好使用：

df['predictions'] = y_pred_class

Answer 2

我认为您的问题是您的预测数组比原始df短，因为您分为训练和测试集。

您定义为X_test的{{1}}数组，看来您正在获取该列的最后50行。

我要做的是创建一个与您的预测数组长度相同的prediction_df。在您的情况下，您需要的行是原始df的最后50行。

newdata.question[:50]

只需确保您的projection_df行与您用于制作prediction_df = df.iloc[:50] prediction_df['predictions'] = y_pred_class的行相匹配！

将预测结果合并到原始数据帧？

下面是我的代码。

这是我添加的新数据，用于根据与数组相同的长度进行预测

2 个答案: