如何对时间序列数据进行超采样以解决类不平衡问题?

时间:2019-02-02 01:35:12

标签: python pandas classification oversampling

我有一个每小时频率的时间序列,每天有一个标签。我想通过过度采样来解决班级不平衡问题,同时保留每一天的时间顺序。理想情况下,我将能够比随机过采样更好地使用ADASYN或其他方法。数据如下所示:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
np.random.seed(seed=1111)

date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(45), freq='H')

data = np.random.random(size=len(days))
data2 = np.random.random(size=len(days))
df = pd.DataFrame({'DateTime': days, 'col1': data, 'col_2' : data2})
df['Date'] = [df.loc[i,'DateTime'].floor('D') for i in range(len(df))]

class_labels = []
for i in df['Date'].unique():
    class_labels.append([i,np.random.choice((1,2,3,4,5,6,7,8,9,10),size=1,
                                           p=(.175,.035,.016,.025,.2,.253,.064,.044,.072,.116))[0]])
class_labels = pd.DataFrame(class_labels)

df['class_label'] = [class_labels[class_labels.loc[:,0] == df.loc[i,'Date']].loc[:,1].values[0] for i in range(len(df))]
df = df.set_index('DateTime')
df.drop('Date',axis=1,inplace=True)

print(df['class_label'].value_counts())
df.head(15)

Out[209]: 
5     264
1     240
6     145
9     120
7     120
10     72
8      72
4      24
2      24

Out[213]: 
                                col1     col_2  class_label
DateTime                                                   
2019-02-01 18:28:29.214935  0.095549  0.307041            6
2019-02-01 19:28:29.214935  0.925004  0.981620            6
2019-02-01 20:28:29.214935  0.343573  0.610662            6
2019-02-01 21:28:29.214935  0.310477  0.482961            6
2019-02-01 22:28:29.214935  0.002010  0.242208            6
2019-02-01 23:28:29.214935  0.235595  0.355516            6
2019-02-02 00:28:29.214935  0.237792  0.028726            5
2019-02-02 01:28:29.214935  0.735916  0.221198            5
2019-02-02 02:28:29.214935  0.495468  0.712723            5
2019-02-02 03:28:29.214935  0.784425  0.818065            5
2019-02-02 04:28:29.214935  0.126506  0.414326            5
2019-02-02 05:28:29.214935  0.606649  0.264835            5
2019-02-02 06:28:29.214935  0.466121  0.244843            5
2019-02-02 07:28:29.214935  0.237132  0.298100            5
2019-02-02 08:28:29.214935  0.435159  0.621991            5

我想使用ADASYN或SMOTE,但是即使随机过采样来解决类不平衡也将是不错的选择。

理想的结果是像原始结果一样,以小时为单位递增,每天有一个标签并且班级均衡:

print(df['class_label'].value_counts())

Out[211]: 
5     264
1     264
6     264
9     264
7     264
10    264
8     264
4     264
2     264

1 个答案:

答案 0 :(得分:0)

使用先于groupby然后再sample每个子集进行循环

newdf=pd.concat([y.sample(264,replace=True) for _, y in df.groupby('class_label')])
newdf.class_label.value_counts()
9     264
7     264
5     264
1     264
10    264
8     264
6     264
4     264
2     264
Name: class_label, dtype: int64