问题:我如何在Python中针对时间序列数据集(1)进行训练/测试拆分和前向链接,该时间数据集本质上是多个时间序列的组合(例如,多个公司的财务绩效)和(2)班级不平衡?
假设您有一个像financial distress dataset这样的数据集。每行对应于特定时间点公司的财务状况。一个公司可以有多行,因此数据集适合进行时间序列预测。可以将目标变量设置为1
(财务状况不佳)或0
(健康)的二进制变量
我想进行正向链接,但是TimeSeriesSplit
假定您的数据集可以视为一个实体。在此数据集中情况并非如此,因为您可以说与一个公司相对应的行本身就是一个时间序列。
Someone managed to implement a solution in Python to get around this。该代码将在每个公司时间序列上执行火车测试拆分,并将“填充”更短的时间序列。
但是有一个问题:这在类不平衡的类中不能很好地工作。我在类似的数据集上运行了以下代码,并尝试计算AOC得分,但我不断得到"ValueError: Only one class present in y_true. ROC AUC score is not defined in that case."
,看来对于某些“褶皱”来说,少数族裔的案例并不多。
是否有任何方法可以拆分此类数据集,以便每个拆分中都能充分表示每个类?
# Create Time-Series sampling function to draw train-test splits
def ts_sample(df_input, train_rows, test_rows):
"""
Function to draw specified train_rows and test_rows in time-series rolling sampling format
:param df_input: Input DataFrame
:param train_rows: Number of rows to use as training set
:param test_rows: Number of rows to use as test set
:return: List of tuples. Each tuple contains 2 lists of indexes corresponding to train and test index
"""
if df_input.shape[0] <= train_rows:
return [(df_input.index, pd.Index([]))]
i = 0
train_lower, train_upper = 0, train_rows + test_rows*i
test_lower, test_upper = train_upper, min(train_upper + test_rows, df_input.shape[0])
result_list = []
while train_upper < df_input.shape[0]:
# Get indexes into result_list
# result_list += [([df_input.index[train_lower], df_input.index[train_upper]],
# [df_input.index[test_lower], df_input.index[test_upper]])]
result_list += [(df_input.index[train_lower:train_upper],
df_input.index[test_lower:test_upper])]
# Update counter and calculate new indexes
i += 1
train_upper = train_rows + test_rows*i
test_lower, test_upper = train_upper, min(train_upper + test_rows, df_input.shape[0])
return result_list
# Depending on size of group, the output size of ts_sample (which is a list of (train_index, test_index))
# tuples will vary. However, we want the size of each of these lists to be equal.
# To do that, we will augment the smaller lists by appending the last seen train_index and test_index
# For example:
# group 1 => [(Int64Index([1, 2, 3], dtype='int64'), (Int64Index[4, 5], dtype='int64)),
# (Int64Index([1, 2, 3, 4, 5], dtype='int64'), (Int64Index([6], dtype='int64'))]
# group 2 => [(Int64Index([10, 11, 12], dtype='int64'), (Int64Index[13, 14], dtype='int64')),
# (Int64Index([10, 11, 12, 13, 14), Int64Index([15, 16])),
# (Int64Index([10, 11, 12, 13, 14, 15, 16]), Int64Index([17, 18]))]
# Above, group 2 has 3 folds whereas group 1 has 2. We will augment group 2 to also have 3 folds:
# group 1 => [(Int64Index([1, 2, 3], dtype='int64'), (Int64Index[4, 5], dtype='int64)),
# (Int64Index([1, 2, 3, 4, 5], dtype='int64'), (Int64Index([6], dtype='int64')),
# (Int64Index([1, 2, 3, 4, 5, 6]), Int64Index([]))]
grouped_company_cross = df_cross.groupby('Company')
acc = []
max_size = 0
for name, group in grouped_company_cross:
# For each group, calculate ts_sample and also store largest ts_sample output size
group_res = ts_sample(group, 4, 4)
acc += [group_res]
# print('Working on name:' + str(name))
# print(acc)
if len(group_res) > max_size:
# Update the max_size that we have observed so far
max_size = len(group_res)
# All existing lists (apart from the one added latest)in acc need to be augmented
# to match the new max_size by appending the last value in those list (combining train and test)
for idx, list_i in enumerate(acc):
if len(list_i) < max_size:
last_train, last_test = list_i[-1][0], list_i[-1][1]
list_i[len(list_i):max_size] = [(last_train.union(last_test),
pd.Index([]))] * (max_size - len(list_i))
acc[idx] = list_i
elif len(group_res) < max_size:
# Only the last appended list (group_res) needs to be augmented
last_train, last_test = acc[-1][-1][0], acc[-1][-1][1]
acc[-1] = acc[-1] + [(last_train.union(last_test), pd.Index([]))] * (max_size - len(acc[-1]))