训练和测试集具有不同长度的唯一目标标签

时间:2019-11-07 18:59:04

标签: r classification multilabel-classification

我正在尝试不同的算法,包括用于多类顺序分类问题的神经网络。

数据集是66个巴赫合唱,12个二进制音符列和一个目标变量“和弦”,它具有102个唯一标签。

> dim(d)
[1] 5664   17

> head(d)
      c_id event   c c_sh   d d_sh   e   f f_sh   g g_sh   a a_sh   b bass meter chord
1 000106b_     2 YES   NO  NO   NO YES  NO   NO YES   NO  NO   NO  NO    E     5   C_M
2 000106b_     3 YES   NO  NO   NO YES  NO   NO YES   NO  NO   NO  NO    E     2   C_M
3 000106b_     4 YES   NO  NO   NO  NO YES   NO  NO   NO YES   NO  NO    F     3   F_M
4 000106b_     5 YES   NO  NO   NO  NO YES   NO  NO   NO YES   NO  NO    F     2   F_M
5 000106b_     6  NO   NO YES   NO  NO YES   NO  NO   NO YES   NO  NO    D     4   D_m
6 000106b_     7  NO   NO YES   NO  NO YES   NO  NO   NO YES   NO  NO    D     2   D_m

> levels(e$chord)
  [1] " A#d"  " A#d7" " A_d"  " A_m"  " A_M"  " A_m4" " A_M4" " A_m6" " A_M6" " A_m7" " A_M7" " Abd"  " Abm" 
 [14] " AbM"  " B_d"  " B_d7" " B_m"  " B_M"  " B_M4" " B_m6" " B_m7" " B_M7" " Bbd"  " Bbm"  " BbM"  " Bbm6"
 [27] " BbM7" " C#d"  " C#d6" " C#d7" " C#m"  " C#M"  " C#M4" " C#m7" " C#M7" " C_d6" " C_d7" " C_m"  " C_M" 
 [40] " C_M4" " C_m6" " C_M6" " C_m7" " C_M7" " D#d"  " D#d6" " D#d7" " D#m"  " D#M"  " D_d7" " D_m"  " D_M" 
 [53] " D_M4" " D_m6" " D_M6" " D_m7" " D_M7" " Dbd"  " Dbd7" " Dbm"  " DbM"  " Dbm7" " DbM7" " E_d"  " E_m" 
 [66] " E_M"  " E_M4" " E_m6" " E_m7" " E_M7" " Ebd"  " EbM"  " EbM7" " F#d"  " F#d7" " F#m"  " F#M"  " F#M4"
 [79] " F#m6" " F#m7" " F#M7" " F_d"  " F_d7" " F_m"  " F_M"  " F_M4" " F_m6" " F_M6" " F_m7" " F_M7" " G#d" 
 [92] " G#d7" " G#m"  " G#M"  " G_d"  " G_m"  " G_M"  " G_M4" " G_m6" " G_M6" " G_m7" " G_M7"

> length(unique(e$chord))
[1] 102

> nrows_split_d # number of observations for each class label
  [1]   5   4   5 258 352   2  16  10   2  11  56   1   2  37  17   8 217 143   3   2  19  46   5  26 312   6
 [27]   3  10   2  15  24  39   2   9   7   2   2 144 488  16  17   6  20  66   7   1   4   2   2   4 165 503
 [53]  16  12   3  33  58   2   2   4  21   3   1   6 241 295  14  14  24  43   1 146   1  14   1 143  90  12
 [79]   7  19  34   3   1  42 388  14   3   4   7  38  11   6   6   1   3 179 489   8   3   3  18  52
> 

由于一些类标签很少有观察结果,因此我遇到了问题。当涉及到将数据随机采样/划分为训练集和测试集时,我已经找到了一种解决方法,它消除了一些随机性。找到一个种子,使训练集具有102个类别标签中的至少1个。假设有5664个观测值,并且训练集包含70-80%的数据,这是很容易实现的。此外,每当我决定使用的算法不需要矩阵输入/输出时,它似乎都可以正常工作。

当我尝试将其输入到NN中时,相同的策略似乎可行,但是在创建混淆矩阵时,由于矩阵具有不同的维数和索引,我会遇到问题,这是因为内部缺少唯一标签测试集。

即使我减少了训练集和替换样本的大小,我也找不到一个种子,其中训练集和测试集都包含102个类别标签中的每一个。

我可能尝试的一种潜在解决方案是在类标签的频率较低的情况下引入重复的观察结果。但是我很犹豫,因为这基本上是作弊。

1)为了确保训练集包含每个唯一的类标签之一,我是否“允许” /“良好做法”是操纵种子?

2)是否有不涉及种子处理/引入重复的解决方案?

如果我不能确保训练中的102个唯一性,我将无法将数据传递给大多数算法,因为差异会导致错误。

> length(unique(test$chord))
[1] 75
> length(unique(train$chord))
[1] 102

t <- table(predict=pre, actual=test_t)

> length(unique(pre))
[1] 62
> length(unique(test_t))
[1] 85

nn_accuracy <- sum(diag(t) / sum(t))
> nn_accuracy
[1] 0.4077125

0 个答案:

没有答案