使用Adasyn（）或SMOTE（）错误：无法将字符串转换为浮点型

时间：2018-07-02 18:02:39

标签： stringtokenizer

我正在尝试对包含两列感兴趣的列（event_type和notes）的ALECD数据框应用SMOTE或ADASYN技术，事件类型表示我分解为整数（0-9）的事件类别，并且notes是自由文本列其中描述了事件，我的事件数量不平衡，并且我尝试使用SMOTE或ADASYN解决不平衡问题，预处理过程如下：

with open('2008-01-01-2018-01-01.csv', 'r') as csvfile:
    ALECD = pd.read_csv(csvfile, low_memory=False)
    print(ALECD.shape)


#Cleaning data ,deleting entries with missing notes and converting event types to number
ALECD = ALECD[pd.notnull(ALECD['notes'])]
print(ALECD.shape)
ALECD['category_id'] = ALECD['event_type'].factorize()[0]
category_id_ALECD = ALECD[['event_type', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_ALECD.values)
id_to_category = dict(category_id_ALECD[['category_id', 'event_type']].values)

文本表示

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(ALECD.notes).toarray()
labels = ALECD.category_id
print(features.shape)

将数据集拆分为训练和测试集

X_train, X_test, y_train, y_test = train_test_split(np.array(list(ALECD['notes'])).reshape(-1, 1), \
                                                    np.array(list(ALECD['event_type'])).reshape(-1 , 1), random_state=0)

应用随机过采样

ada = ADASYN()
X_resampled, y_resampled = ada.fit_sample(X_train, y_train)

错误

ValueError: could not convert string to float: 'Pro-Houthi Saba news reported that coalition warplanes launched two air raids on Hodeidah airport on Friday, reportedly causing heavy damage. No fatalities or injuries were reported.'

0 个答案:

没有答案