另外,如何评价 AdamW优化器?
另外,如何评价gelu激活函数?
如何评价 new gelu激活函数呢?
On LAMB
At first, they thought all gradients are equal, so SGD should work for everything.
Then they realize some gradients are different, and there needs to be some sort of adaptation, like rmsprop/adagrad/adam.
After some more time, they realize that the variation in gradients cannot simply be characterized by a scalar/ bunch of scalars. the degree of adaptation needs to catch up with the degrees of variation. more sophisticated adaptation schemes were developed: normalization, feedback control, and so on.
If we go down this path we're likely to end up with network topologies where feedback/normalization mechanisms are distributed among the massive number of weights, each taking care of the few weights around it. much like a mammal brain.