If I understand correctly, when using deep learning with mini batches, we have a forward and backward pass in every mini batch (with the corresponding optimizer). But does something different happen at the end of the epoch (after using all mini batches)?
The reason I'm asking is that in my implementation of a u-net for image segmentation, I see that with every mini batch the loss slightly decreases (in the order of 0.01). Then, when the new epoch starts, the loss in the first mini batch with respect to the last mini batch in the previous epoch changes a lot (in the order of 0.5). Also, after the first epoch, the loss for the test data is in the order of the loss for the first mini batch in the next epoch.
I would interpret this as if the weights are updated faster at the end of the epoch than for the different mini batches, but I have found no theory supporting this. I would appreciate an explanation.
As for the optimizer, this is happening both with stochastic gradient descent and with Adam. If it helps, I am using Keras.