I wonder what would happen with this analysis if a momentum term was added to the gradient descent. It seems that it would fix the specific failure modes in the examples, but I wonder if there's a corresponding mathematical way of categorizing what kinds of functions can(not) be quickly optimized with GD + momentum.
The neural net answer is being able to spawn a wavelet at any position, as opposed to tweaking the position of an existing one.
What's the question this method is attempting to answer? What does an answer look like? How does this method lead to it?
> If you have ever tried to minimize a function with gradient descent
"and if otherwise, go kick sand," I guess.
[dead]