By adding the GradientClip callback, the gradient norm_type (default:2) norm
is clipped to at most max_norm (default:1) using torch::nn_utils_clip_grad_norm_(),
which can avoid loss divergence.
References
See FastAI documentation for the GradientClip callback.