处理分类中的不平衡数据集

时间:2021-05-14 10:29:22

标签: python jupyter accounting fraud-prevention

我有一个基于会计欺诈的大型数据框,我想解决数据不平衡的问题。

首先,我将数据框拆分为 2 个:X(变量)和 y(目标,即:欺诈或不欺诈)

我试过了:

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.combine import SMOTEENN

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler

X = df[['fyear', 'gvkey', 'sich', 'insbnk', 'understatement', 'option',
       'p_aaer', 'new_p_aaer', 'act', 'ap', 'at', 'ceq', 'che',
       'cogs', 'csho', 'dlc', 'dltis', 'dltt', 'dp', 'ib', 'invt', 'ivao',
       'ivst', 'lct', 'lt', 'ni', 'ppegt', 'pstk', 're', 'rect', 'sale',
       'sstk', 'txp', 'txt', 'xint', 'prcc_f', 'dch_wc', 'ch_rsst', 'dch_rec',
       'dch_inv', 'soft_assets', 'ch_cs', 'ch_cm', 'ch_roa', 'issue', 'bm',
       'dpi', 'reoa', 'EBIT', 'ch_fcf']]
y = df[['target']]

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
print('Resampled dataset shape {}'.format(Counter(y_res)))

还有这个

# define sampling strategy
sample = SMOTEENN(sampling_strategy=0.5)
# fit and apply the transform
X_over, y_over = sample.fit_resample(X, y)
# summarize class distribution
print(Counter(y_over)) 

但在这两种情况下,结果都是一样的:

ValueError: could not convert string to float: '2.461.242' 

请问,有人可以帮我吗?

0 个答案:

没有答案