我正在使用SMOTE重新采样包含值0和1的二进制类SecondView
。0具有大约900条记录,而1仅具有大约100条记录。我想将第1类的采样率提高到800左右。
这是为了执行一些分类建模。
TARGET_FRAUD
这是重新采样前的值计数:
#fix imbalanced data
from imblearn.over_sampling import SMOTE
#bar plot of target_fraud distribution
sns.countplot('TARGET_FRAUD', data=df)
plt.title('Before Resampling')
plt.show()
#Synthetic Minority Over-Sampling Technique
sm = SMOTE()
# Fit the model to generate the data.
oversampled_trainX, oversampled_trainY = sm.fit_resample(df.drop('TARGET_FRAUD', axis=1), df['TARGET_FRAUD'])
resampled_df = pd.concat([pd.DataFrame(oversampled_trainY), pd.DataFrame(oversampled_trainX)], axis=1)
resampled_df.columns = df.columns
sns.countplot('TARGET_FRAUD', data=resampled_df)
plt.title('After Resampling')
plt.show()
这是重新采样后的值计数:
TARGET_FRAUD:
0 898
1 102
为什么会产生0到1之间的随机浮点值?我只希望它返回0和1的int值。
答案 0 :(得分:1)
我没有您的数据集,但是根据您的代码,我制作了一个可重复的示例。我不能复制你在写什么。
from collections import Counter
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
X, y = make_classification(random_state=0, weights=[0.9, 0.1])
df = pd.DataFrame(X)
df["TARGET_FRAUD"] = y
print("Before resampling")
print(Counter(df["TARGET_FRAUD"]))
sm = SMOTE()
# Fit the model to generate the data.
oversampled_trainX, oversampled_trainY = sm.fit_resample(
df.drop("TARGET_FRAUD", axis=1), df["TARGET_FRAUD"]
)
resampled_df = pd.concat(
[pd.DataFrame(oversampled_trainY), pd.DataFrame(oversampled_trainX)],
axis=1,
)
print("Before resampling")
print(Counter(resampled_df["TARGET_FRAUD"]))
可打印
Before resampling
Counter({0: 90, 1: 10})
Before resampling
Counter({0: 90, 1: 90})