Limitations of the Empirical Fisher Approximation
- Frederik Künstner (École Polytechnique Fédérale de Lausanne)
Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, has recently received attention as a way to capture partial second-order information. Several works have advocated an approximation known as the empirical Fisher, drawing connections between approximate second-order methods and heuristics like Adam. We caution against this argument by discussing the limitations of the empirical Fisher, showing that—unlike the Fisher— it does not generally capture second-order information. We further argue that the conditions under which the empirical Fisher approaches the Fisher (and the Hessian) are unlikely to be met in practice, and that the pathologies of the empirical Fisher can have undesirable effects. This leaves open the question as to why methods based on the empirical Fisher have been shown to outperform gradient descent in some settings. As a step towards understanding this effect, we show that methods based on the empirical Fisher can be interpreted as a way to adapt the descent direction to the variance of the gradients.