mathe online bietet eine Galerie multimedialer Lernhilfen für Schule Studium und Selbststudium. mathe online provides a gallery of mathematics multimedia learning tools (in German language).
The first 50k steps of the training the loss is quite stable and low and suddenly it starts to exponentially explode. I wonder how this can happen. Of course there are many reasons a loss can increase such as a too high learning rate. But what I do not understand is the following:

Moreover I also noticed that training on CPU is slowed down perhaps the two things are correlated(?). I originally encountered this problem with an installation of pytorch 2.2 and cuda 12.1 so thinking it was due to these versions I re-ran a clean install of both but the same problem also occurs with versions 2.4 and 12.4.

If your shuffled data happens to include a cluster of related strongly-featured observations your model's initial training can skew badly toward those features -- or worse toward incidental features that aren't truly related to the topic at all. Warm-up is a way to reduce the primacy effect of the early training examples. Without it you may ...

The fraction of the training data to be used as validation data. The model will set apart this fraction of the training data will not train on it and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided before shuffling.

However a large difference in performance between the training set and the test set can indicate overfitting. In your case the difference in accuracy between the training set and the test set (99% vs 96%) is not very large so it is highly unlikely that the model is severely overfitting. However it is still a good idea to check for other ...

