有没有人成功(故意)用MNIST过度拟合神经网络?

时间:2019-04-22 01:39:24

标签: autoencoder relu

我目前正在学习“神经网络的代表性(表达能力)”主题,并试图有意完全过度拟合神经网络,这意味着至少该模型具有在训练数据输入/过程中完美构建映射的能力输出。

我目前在该实验中使用的数据是MNIST,我正在尝试使用AutoEncoder / Decoder结构来检查是否可以故意使神经网络不适用于该网络结构。

我通常感兴趣的是哪种潜在维数组合和多少ReLU是扩大神经网络表达能力的最佳组合,这意味着该组合最小程度地减少了训练损失(在这种情况下,我使用了二进制交叉x和recon_x之间的熵)

问题是,我没有成功地过拟合(损失几乎接近0)。

我尝试了几种具有不同潜在维度的深/浅FCN,我的最佳最小损失成就是55,与0相比看起来太大了。

import torch
import torch.nn as nn


class AE(nn.Module):

    def __init__(self,
                 encoder_layer_sizes,
                 latent_size,
                 decoder_layer_sizes,
                 num_labels=0):

        super().__init__()

        assert type(encoder_layer_sizes) == list
        assert type(latent_size) == int
        assert type(decoder_layer_sizes) == list

        self.latent_size = latent_size

        self.encoder = Encoder(
            encoder_layer_sizes,
            latent_size,
            num_labels)
        self.decoder = Decoder(
            decoder_layer_sizes,
            latent_size,
            num_labels)

    def forward(self,
                x,
                c=None):

        if x.dim() > 2:
            x = x.view(-1, 28*28)

        z = self.encoder(x, c)

        recon_x = self.decoder(z, c)

        return recon_x, z

    def inference(self, device, n=1, c=None):

        batch_size = n
        z = torch.randn([batch_size,
                         self.latent_size]).to(device)

        recon_x = self.decoder(z, c)

        return recon_x


class Encoder(nn.Module):

    def __init__(self,
                 layer_sizes,
                 latent_size,
                 num_labels):

        super().__init__()


        self.MLP = nn.Sequential()

        for i, (in_size, out_size) in enumerate(zip(layer_sizes[:-1],
                                                    layer_sizes[1:])):
            print(i, ": ", in_size, out_size)
            self.MLP.add_module(name="L{:d}".format(i),
                                module=nn.Linear(in_size, out_size))
            if i != len(layer_sizes):
                print("ReLU added @ Encoder")
                self.MLP.add_module(name="A{:d}".format(i),
                                    module=nn.ReLU())
                # self.MLP.add_module(name="BN{:d}".format(i),
                #                     module=nn.BatchNorm1d(out_size))

        self.linear = nn.Linear(layer_sizes[-1], latent_size)

    def forward(self, x, c=None):


        x = self.MLP(x)

        z = self.linear(x)

        return z


class Decoder(nn.Module):

    def __init__(self,
                 layer_sizes,
                 latent_size,
                 num_labels):

        super().__init__()

        self.MLP = nn.Sequential()
        input_size = latent_size

        for i, (in_size, out_size) in enumerate(
                zip([input_size]+layer_sizes[:-1], layer_sizes)):
            print(i, ": ", in_size, out_size)
            self.MLP.add_module(
                name="L{:d}".format(i), module=nn.Linear(in_size, out_size))
            if i+1 < len(layer_sizes):
                if i != 0:
                    print("ReLU added @ Decoder")
                    self.MLP.add_module(name="A{:d}".format(i), module=nn.ReLU())
                    # self.MLP.add_module(name="BN{:d}".format(i),
                    #                     module=nn.BatchNorm1d(out_size))

            else:
                print("Sig step")
                self.MLP.add_module(name="sigmoid", module=nn.Sigmoid())

    def forward(self, z, c):

        x = self.MLP(z)

        return x

这是我使用的模型代码,如果我将[784,256,256]放入变量“ layer_sizes”,则该模型将对称地生成编码器解码器,并在给定的输入/输出尺寸线性变换之间使用ReLU。

我尝试了很多“ layer_sizes”,并附上了日志以供参考。

## Goal of the Project
The project goal is about the way to determine the `optimal number of latent dimension`. 

First, the project introduces the linearity and non-linearity and postulates the assumption that linearity corresponds to `one` dimension. Then, this linearity could be split into `two` non-overlapping dimension by one ReLU based non-linearity. 

Therefore, this project shows that the determination of optimal number of latent dimension
 preliminarily `not depend on the data distribution itself`, but depends on `the network structure`, 
 more specifically, depends on the `total number of dimension that the model 
about to express`. The paper will call this total number of dimension that the 
model about to express as **model dimension**.

After the model dimension being set, one can train the network and check whether 
it's possible to over-fit the network with the data given. If the data points 
over-fit in some point of train epochs, this network can be thought as "enough to 
express the data distribution". However, if not over-fit, one can consider to 
enlarge the **model dimension** and re-try the over-fit process.

## To-do
Define the over-fit.
The classification threshold of over-fit depends on the experiment. 
- In which epoch of training process one should determine over-fit? 

## Caution
It's better to use whole data when to determine the "model dimension" since 
it's about how much non-linearity is required for the collected or targeted 
data domain.

## Convergence Determination Metric
When the EpochAVGLoss doe not change more than 1 % for 5 epochs from the first epoch, we determine the training loss being converged 

## Experiment Workflow

##### Exp_1 : 1 ReLU applied to 256 dimension. (Then Linear Transformation to LatentDim)

By the assumption, the **model dimension** is 512(256*2). Thus, we verify the assumption by

1) check the sequential decrease of Loss at certain train epoch while sequentially increase the LatentDim

with `1 * (MLP + ReLU) + LatentDim 1` 

    Epoch 09/10 Batch 0937/937, Loss  165.5437

with `1 * (MLP + ReLU) + LatentDim 2` 

    Epoch 09/10 Batch 0937/937, Loss  150.2990

with `1 * (MLP + ReLU) + LatentDim 3` 

    Epoch 09/10 Batch 0937/937, Loss  133.2206

with `1 * (MLP + ReLU) + LatentDim 4` 

    Epoch 09/10 Batch 0937/937, Loss  138.1151

with `1 * (MLP + ReLU) + LatentDim 8` 

    Epoch 09/10 Batch 0937/937, Loss  110.9839

with `1 * (MLP + ReLU) + LatentDim 16` 

    Epoch 09/10 Batch 0937/937, Loss 89.6707

with `1 * (MLP + ReLU) + LatentDim 32` 

    Epoch 09/10 Batch 0937/937, Loss 72.5663

with `1 * (MLP + ReLU) + LatentDim 64` 

    Epoch 09/10 Batch 0937/937, Loss 54.2545

> ... since the model converges at LatentDim 64 with Loss 52, we shrink down the ReLU_InputDim to 32 (go to Exp3)

with `1 * (MLP + ReLU) + LatentDim 128` 

    Epoch 09/10 Batch 0937/937, Loss   54.3565

with `1 * (MLP + ReLU) + LatentDim 256` 

    Epoch 09/10 Batch 0937/937, Loss   52.3050

> ... must keep decreasing. write the code to automatically does this job 

with `1 * (MLP + ReLU) + LatentDim 512` 

    Epoch 09/10 Batch 0937/937, Loss   53.2412

> ... Check whether at any LatentDim > 512, no decrease of Loss at fixed train epoch. 


with `1 * (MLP + ReLU) + LatentDim 1024` 

    Epoch 09/10 Batch 0937/937, Loss   54.3255

> As you see, with the expansion of LatentDim `doubled`, still the LossAtFixedStep is not decreased, 
which means model dimension already being saturated. 
#### Exp_2: Now Introduce the Twice more model dimension by ReLU 

with `2 * (MLP + ReLU) + LatentDim 1024`

> Epoch 09/10 Batch 0937/937, Loss   57.9039 


(without Bias.. the sequential ReLU doesn't work)


### Exp_3 : Shrink down ReLU InputDim to 32 maintaining latentDim 64

### Summary of Algorithm

    If convgeLoss != 0:
        if modelDim > latentDim:
            enlarge latentDim 
        if modelDim =< latentDim:
            increase #ReLU 

    * modelDim = 2* num_ReLUs 

To verify this, 

@ exp latentDim 64, convergeLoss 80, layerSize [784, 32], 
if one increase the latentDim, convergeLoss should not be below 80

Let's Check! 
@ exp latentDim 128, convergeLoss 80, layerSize [784, 32], convergeLoss 80 

now, let's add stack the double ReLU layers, [784, 32, 32], which is assumably represents 128 dimension 
@ exp latentDim 128, convergeLoss 80, layerSize [784, 32, 32], convergeLoss 80 (still same)

As you see, without enlarge of foremost dimension, the deeper ReLU does not work. This is reference from Raghu(2017)

Now make it wide, such as [784, 64],
@ exp_1555829642 latentDim 128, convergeLoss 80, layerSize [784, 64], the convergeLoss 65 < 80  

moreover, make it more wide, such as [784, 128],
@ exp_1555829642 latentDim 128, convergeLoss 55, layerSize [784, 128], the convergeLoss 55 < 80  

moreover, make it more wide, such as [784, 256],
@ exp_1555832143 latentDim 128, convergeLoss 55, layerSize [784, 256], the convergeLoss  55 = 55  

The problem is, latentDim. Make sure the latentDim is sufficient 
@ exp_1555832638 latentDim 256, convergeLoss 55, layerSize [784, 256], the convergeLoss  55 = 55  

===> Question! How to determine latentDim with less effort not getting through this cumbersome experimental step?

The problem is, latentDim. Make sure the latentDim is sufficient 
@ exp_1555832638 latentDim 128, convergeLoss 65, layerSize [784, 256, 256], the convergeLoss  65 > 55  

The problem is, latentDim. Make sure the latentDim is sufficient 
@ exp_1555832638 latentDim 256, convergeLoss 65, layerSize [784, 256, 256], the convergeLoss  68 > 55  

The problem is, latentDim. Make sure the latentDim is sufficient 
@ exp_1555832638 latentDim 64, convergeLoss 65, layerSize [784, 256, 256], the convergeLoss  68 > 55  

The problem is, latentDim. Make sure the latentDim is sufficient 
@ exp_1555832638 latentDim 128, convergeLoss 60, layerSize [784, 256, 128], the convergeLoss  60 > 55  

The problem is, latentDim. Make sure the latentDim is sufficient 
@ exp_1555834546 latentDim 64, convergeLoss 65, layerSize [784, 256, 256], the convergeLoss  55 = 55  

=====> decrease the latentDim makes the model to learn better (Q1) 

The problem is, latentDim. Make sure the latentDim is sufficient 
@ exp_1555834546 latentDim 32, convergeLoss 65, layerSize [784, 256, 256], the convergeLoss  60 > 55  


If one check the currently get 55, 

    listed as:
          [784, 128], ld 128 
          [784, 128], ld 256 
          [784, 256, 256], ld 64

@ 1555843696, ld64 [784, 128, 128] convergeLoss 60>55
@ 1555844254, ld128 [784, 128, 128] convergeLoss 64>55
@ 1555844254, ld32 [784, 128, 128] convergeLoss 66>55


Dont know why, but if the network is deeper, too many latent space decrease the learning efficiency (Q1)




The problem is, latentDim. Make sure the latentDim is sufficient 
@ exp_1555832638 latentDim 32, convergeLoss 65, layerSize [784, 256, 256], the convergeLoss  55 = 55  


Maybe, if the modelDim is too big and latentDim is too small, as seen in exp [784, 32, 32], 
training might be not working. Thus, we have leverage up the latentDim at the same setting from 128 to 256 
@ exp_1555830495 convergeLoss 80 (still same)

如果有人成功或看到了可复制的代码/报告,这些代码/报告成功地学习了具有自动编码器结构的MNIST的严格身份映射,请与我联系!

0 个答案:

没有答案