火车/测试拆分和前向链接的Python实现,用于不平衡的时间序列分类

时间:2020-06-04 17:16:36

标签: python time-series kaggle

问题:我如何在Python中针对时间序列数据集(1)进行训练/测试拆分和前向链接,该时间数据集本质上是多个时间序列的组合(例如,多个公司的财务绩效)和(2)班级不平衡?

假设您有一个像financial distress dataset这样的数据集。每行对应于特定时间点公司的财务状况。一个公司可以有多行,因此数据集适合进行时间序列预测。可以将目标变量设置为1(财务状况不佳)或0(健康)的二进制变量

enter image description here

我想进行正向链接,但是TimeSeriesSplit假定您的数据集可以视为一个实体。在此数据集中情况并非如此,因为您可以说与一个公司相对应的行本身就是一个时间序列。

Someone managed to implement a solution in Python to get around this。该代码将在每个公司时间序列上执行火车测试拆分,并将“填充”更短的时间序列。

但是有一个问题:这在类不平衡的类中不能很好地工作。我在类似的数据集上运行了以下代码,并尝试计算AOC得分,但我不断得到"ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.",看来对于某些“褶皱”来说,少数族裔的案例并不多。

是否有任何方法可以拆分此类数据集,以便每个拆分中都能充分表示每个类?

# Create Time-Series sampling function to draw train-test splits
def ts_sample(df_input, train_rows, test_rows):
    """
    Function to draw specified train_rows and test_rows in time-series rolling sampling format
    :param df_input: Input DataFrame
    :param train_rows: Number of rows to use as training set
    :param test_rows: Number of rows to use as test set
    :return: List of tuples. Each tuple contains 2 lists of indexes corresponding to train and test index
    """
    if df_input.shape[0] <= train_rows:
        return [(df_input.index, pd.Index([]))]

    i = 0
    train_lower, train_upper = 0, train_rows + test_rows*i
    test_lower, test_upper = train_upper, min(train_upper + test_rows, df_input.shape[0])

    result_list = []
    while train_upper < df_input.shape[0]:
        # Get indexes into result_list
        # result_list += [([df_input.index[train_lower], df_input.index[train_upper]],
        #                  [df_input.index[test_lower], df_input.index[test_upper]])]
        result_list += [(df_input.index[train_lower:train_upper],
                         df_input.index[test_lower:test_upper])]

        # Update counter and calculate new indexes
        i += 1
        train_upper = train_rows + test_rows*i
        test_lower, test_upper = train_upper, min(train_upper + test_rows, df_input.shape[0])

    return result_list

# Depending on size of group, the output size of ts_sample (which is a list of (train_index, test_index))
# tuples will vary. However, we want the size of each of these lists to be equal.
# To do that, we will augment the smaller lists by appending the last seen train_index and test_index
# For example:
# group 1 => [(Int64Index([1, 2, 3], dtype='int64'), (Int64Index[4, 5], dtype='int64)),
#             (Int64Index([1, 2, 3, 4, 5], dtype='int64'), (Int64Index([6], dtype='int64'))]
# group 2 => [(Int64Index([10, 11, 12], dtype='int64'), (Int64Index[13, 14], dtype='int64')),
#             (Int64Index([10, 11, 12, 13, 14), Int64Index([15, 16])),
#             (Int64Index([10, 11, 12, 13, 14, 15, 16]), Int64Index([17, 18]))]
# Above, group 2 has 3 folds whereas group 1 has 2. We will augment group 2 to also have 3 folds:
# group 1 => [(Int64Index([1, 2, 3], dtype='int64'), (Int64Index[4, 5], dtype='int64)),
#             (Int64Index([1, 2, 3, 4, 5], dtype='int64'), (Int64Index([6], dtype='int64')),
#             (Int64Index([1, 2, 3, 4, 5, 6]), Int64Index([]))]
grouped_company_cross = df_cross.groupby('Company')
acc = []
max_size = 0
for name, group in grouped_company_cross:
    # For each group, calculate ts_sample and also store largest ts_sample output size
    group_res = ts_sample(group, 4, 4)
    acc += [group_res]
    # print('Working on name:' + str(name))
    # print(acc)

    if len(group_res) > max_size:
        # Update the max_size that we have observed so far
        max_size = len(group_res)

        # All existing lists (apart from the one added latest)in acc need to be augmented
        # to match the new max_size by appending the last value in those list (combining train and test)
        for idx, list_i in enumerate(acc):
            if len(list_i) < max_size:
                last_train, last_test = list_i[-1][0], list_i[-1][1]
                list_i[len(list_i):max_size] = [(last_train.union(last_test),
                                                 pd.Index([]))] * (max_size - len(list_i))

                acc[idx] = list_i

    elif len(group_res) < max_size:
        # Only the last appended list (group_res) needs to be augmented
        last_train, last_test = acc[-1][-1][0], acc[-1][-1][1]
        acc[-1] = acc[-1] + [(last_train.union(last_test), pd.Index([]))] * (max_size - len(acc[-1]))

0 个答案:

没有答案