我正在尝试分类模型。我正在使用SGDClassifier()
我的df有两列[全文,标签]
和 下面是我的脚本
df_scraped = pd.read_csv('data/labeled_tweets.csv') df_public = pd.read_csv('data/public_data_labeled.csv')
df_scraped.drop_duplicates(inplace = True) df_scraped.drop('id', axis
= 'columns', inplace = True) df_public.drop_duplicates(inplace = True) df = pd.concat([df_scraped, df_public])
for index, row in df.iterrows():
text = row['full_text']
text = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", text).split())
df.at[index,'full_text'] = text
df['label'] = df.label.map({'Offensive': 1, 'Non-offensive': 0})
X_train, X_test, y_train, y_test = train_test_split(df['full_text'],
df['label'],
random_state=99)
print (X_train['full_text'].head(3))
print('Number of rows in the total set: {}'.format(df.shape[0])) print('Number of rows in the training set: {}'.format(X_train.shape[0])) print('Number of rows in the test set: {}'.format(X_test.shape[0]))
count_vector = CountVectorizer(stop_words = 'english', lowercase = True) training_data = count_vector.fit_transform(X_train) testing_data
= count_vector.transform(X_test)
# Dict for parameters param_grid = {
'alpha' : [0.095, 0.0002, 0.0003],
'max_iter' : [2500, 3000, 4000] }
print(X_train[0])
### label encode the categorical values and convert them to numbers le = LabelEncoder() le.fit(X_train[1].astype(str)) X_train[1] = le.transform(X_train[1].astype(str)) X_test[1] = le.transform(X_test[1].astype(str))
### train the model clf_sgd = SGDClassifier() clf_sgd.fit(X_train, y_train)
运行此脚本时出现错误 KeyError:“ full_text”
上述异常是以下异常的直接原因:
我不明白为什么会这样。我正在使用编码器来编码要浮动的字符串,以便可以在模型中使用它。
任何帮助将不胜感激。谢谢