Question

我有以下数据集：

input_data = pd.DataFrame([['This is the news', 0], ['This is the news', 0], ['This is not the news', 1], ['This is not the news', 1], ['This is not the news', 1], ['This is not the news', 1]], columns=('feature1', 'Tag'))

我想使用以下函数转换为TF-IDF矩阵

def TfifdMatrix(inputSet):
    vectorizer = CountVectorizer()
    vectorizer.fit_transform(inputSet)
    print("fit transform done")
    smatrix = vectorizer.transform(inputSet)
    print("transform done")
    smatrix = smatrix.todense()
    tfidf = TfidfTransformer(norm="l2")
    tfidf.fit(smatrix)
    tf_idf_matrix = tfidf.transform(smatrix)
    print("transformation done")
    TfidfMatrix = pd.DataFrame(tf_idf_matrix.todense())
    return (TfidfMatrix)

现在我转换数据并添加标签

input_data2 = TfifdMatrix(input_data['feature1'])
input_data = pd.concat([input_data, input_data2], axis=1)

创建培训和测试集

train = input_data.sample(frac=0.8, random_state=1)
test = input_data.loc[~input_data.index.isin(train.index)]

train_outcome = train['Tag'].values
train_features = train.drop('Tag', axis=1)
test_outcome = test['Tag'].values
test_features = test.drop('Tag', axis=1)

test_features2 = test['Tag']

我不是在训练决策树算法

my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(train_features.drop('feature1', axis=1), train_outcome)
my_dt_prediction = my_tree_one.predict(test_features.drop('feature1', axis=1))

Now I combine everyhting to get an overview of the original features, the real outcome, the predicted outcome and the TF-IDF matrix

df_final = pd.DataFrame(test_features, test_outcome)
df_final['Prediction'] = my_dt_prediction

然而，这给了我以下数据：

  feature1   0   1   2   3   4  Prediction
  1      NaN NaN NaN NaN NaN NaN           1

有关出错的问题吗？

Answer 1

考虑到您已经使用了sklearn，我会使用train_test_split来进行数据集拆分。

from sklearn.model_selection import train_test_split
from sklearn import tree
import pandas as pd

Y = input_data['Tag']
X = input_data.drop('Tag', axis=1)

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=123)

# Train and predict
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(Xtrain, Ytrain)
my_dt_prediction = my_tree_one.predict(Xtest)

# Join it all
complete_df = pd.concat([Xtest, Ytest], axis=1)  # features and actual
complete_df['Predicted'] = my_dt_prediction  # creates a predicted column to the complete_df, now you'll have features, actual, and predicted

您可以删除一行并创建预测列并在一行中生成预测：

complete_df['Predicted'] = my_tree_one.predict(Xtest)

- UPDATE -

所以在我的评论中，我提到使用＆＃34;键＆＃34;专栏，但解决方案实际上比这简单。

假设您的input_data包含原始单词要素和目标变量，则只需将TDIDF算法应用于input_data，然后将TDIDF转换后的矩阵添加到input_data。

input_data = pd.DataFrame([['This is the news', 0], ['This is the news', 0], ['This is not the news', 1]], columns=('feature1', 'Tag'))

def TfifdMatrix(inputSet):  
    vectorizer = CountVectorizer()
    vectorizer.fit_transform(inputSet)
    print("fit transform done")

smatrix = vectorizer.transform(inputSet)

print("transform done")
smatrix = smatrix.todense()
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(smatrix)
tf_idf_matrix = tfidf.transform(smatrix)

print("transformation done")

TfidfMatrix = pd.DataFrame(tf_idf_matrix.todense())
return (TfidfMatrix)

input_data2 = TfidfMatrix(input_data['Feature1'])

# Add your TDIDF transformation matrix
input_data = pd.concat([input_data, input_data2], axis=1)

# Now do your usual train/test split
train = input_data.sample(frac=0.8, random_state=1)
test = input_data.loc[~input_data.index.isin(train.index)]
train_outcome = train['Tag'].values
train_features = train.drop('Tag', axis=1)
test_outcome = test['Tag'].values
test_features = test.drop('Tag', axis=1)

# Now train but make sure to drop your original word feature for both fit and predict
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(train_features.drop('Feature1', axis=1), train_outcome)
my_dt_prediction = my_tree_one.predict(test_features.drop('Feature1', axis=1))

# Now combine
df_final = pd.DataFrame(test_features, test_outcomes)
df_final['Prediction'] = my_dt_prediction

您应该获得包含原始单词功能，TDIDF转换功能，实际值和预测值的数据框。

使用预测值，实际值和原始要素创建数据框

1 个答案: