我有一个每小时频率的时间序列,每天有一个标签。我想通过过度采样来解决班级不平衡问题,同时保留每一天的时间顺序。理想情况下,我将能够比随机过采样更好地使用ADASYN或其他方法。数据如下所示:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
np.random.seed(seed=1111)
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(45), freq='H')
data = np.random.random(size=len(days))
data2 = np.random.random(size=len(days))
df = pd.DataFrame({'DateTime': days, 'col1': data, 'col_2' : data2})
df['Date'] = [df.loc[i,'DateTime'].floor('D') for i in range(len(df))]
class_labels = []
for i in df['Date'].unique():
class_labels.append([i,np.random.choice((1,2,3,4,5,6,7,8,9,10),size=1,
p=(.175,.035,.016,.025,.2,.253,.064,.044,.072,.116))[0]])
class_labels = pd.DataFrame(class_labels)
df['class_label'] = [class_labels[class_labels.loc[:,0] == df.loc[i,'Date']].loc[:,1].values[0] for i in range(len(df))]
df = df.set_index('DateTime')
df.drop('Date',axis=1,inplace=True)
print(df['class_label'].value_counts())
df.head(15)
Out[209]:
5 264
1 240
6 145
9 120
7 120
10 72
8 72
4 24
2 24
Out[213]:
col1 col_2 class_label
DateTime
2019-02-01 18:28:29.214935 0.095549 0.307041 6
2019-02-01 19:28:29.214935 0.925004 0.981620 6
2019-02-01 20:28:29.214935 0.343573 0.610662 6
2019-02-01 21:28:29.214935 0.310477 0.482961 6
2019-02-01 22:28:29.214935 0.002010 0.242208 6
2019-02-01 23:28:29.214935 0.235595 0.355516 6
2019-02-02 00:28:29.214935 0.237792 0.028726 5
2019-02-02 01:28:29.214935 0.735916 0.221198 5
2019-02-02 02:28:29.214935 0.495468 0.712723 5
2019-02-02 03:28:29.214935 0.784425 0.818065 5
2019-02-02 04:28:29.214935 0.126506 0.414326 5
2019-02-02 05:28:29.214935 0.606649 0.264835 5
2019-02-02 06:28:29.214935 0.466121 0.244843 5
2019-02-02 07:28:29.214935 0.237132 0.298100 5
2019-02-02 08:28:29.214935 0.435159 0.621991 5
我想使用ADASYN或SMOTE,但是即使随机过采样来解决类不平衡也将是不错的选择。
理想的结果是像原始结果一样,以小时为单位递增,每天有一个标签并且班级均衡:
print(df['class_label'].value_counts())
Out[211]:
5 264
1 264
6 264
9 264
7 264
10 264
8 264
4 264
2 264
答案 0 :(得分:0)
使用先于groupby
然后再sample
每个子集进行循环
newdf=pd.concat([y.sample(264,replace=True) for _, y in df.groupby('class_label')])
newdf.class_label.value_counts()
9 264
7 264
5 264
1 264
10 264
8 264
6 264
4 264
2 264
Name: class_label, dtype: int64