我正在尝试对包含两列感兴趣的列(event_type和notes)的ALECD数据框应用SMOTE或ADASYN技术,事件类型表示我分解为整数(0-9)的事件类别,并且notes是自由文本列其中描述了事件,我的事件数量不平衡,并且我尝试使用SMOTE或ADASYN解决不平衡问题,预处理过程如下:
with open('2008-01-01-2018-01-01.csv', 'r') as csvfile:
ALECD = pd.read_csv(csvfile, low_memory=False)
print(ALECD.shape)
#Cleaning data ,deleting entries with missing notes and converting event types to number
ALECD = ALECD[pd.notnull(ALECD['notes'])]
print(ALECD.shape)
ALECD['category_id'] = ALECD['event_type'].factorize()[0]
category_id_ALECD = ALECD[['event_type', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_ALECD.values)
id_to_category = dict(category_id_ALECD[['category_id', 'event_type']].values)
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(ALECD.notes).toarray()
labels = ALECD.category_id
print(features.shape)
X_train, X_test, y_train, y_test = train_test_split(np.array(list(ALECD['notes'])).reshape(-1, 1), \
np.array(list(ALECD['event_type'])).reshape(-1 , 1), random_state=0)
ada = ADASYN()
X_resampled, y_resampled = ada.fit_sample(X_train, y_train)
ValueError: could not convert string to float: 'Pro-Houthi Saba news reported that coalition warplanes launched two air raids on Hodeidah airport on Friday, reportedly causing heavy damage. No fatalities or injuries were reported.'