Why does Deep Learning work?

Deep Learning is one of the hottest fields of Machine Learning currently.

“[The] long-term behavior of certain neural network models are governed by the statistical mechanism of infinite-range Ising spin-glass Hamiltonians” [1].

So, are multilayer neural networks just spin glasses? Interestingly, the answer depends on the definition of a spin glass. 

In his paper, NYU Courant Institute professor Yann LeCun attempts to extend our understanding of training neural networks.

His work on deep learning aims to solve the multilayer neural network optimization problem by studying the stochastic gradient descent (SGD) approach, an iterative method for optimizing an objective function with suitable smoothness properties, [1].  

Furthermore, Professor LeCun claims that “None of these works however make an attempt to explain the paradigm of optimizing the highly non-convex neural network objective function through the prism of spin-glass theory and thus in this respect our approach is very novel.” 

However, we already have a good idea of what the energy landscape of multiscale spin glass models* looks like, thanks to Peter Wolynes’ early work on theoretical protein folding  and more modern works by Ken Dill [2,3,4]). In fact, here is a typical surface:

Energy Landscape of MultiScale Spin Glass Model

The above nodes represent partially folded states, and let us consider them as nodes in a multiscale spin glass or–alternatively–a multilayer neural network.

Accenture’s Chief Data Scientist on Deep Reinforcement Learning

Immediately, we can observe the analogy and the appearance of the “energy funnel”. In fact, researchers have studied the ‘folding funnels’ of spin glass models for more than  20 years [2,3,4].  In addition, we knew that as we increase the network size, the funnel gets sharper.

3D Energy Landscape of the Folding Funnel of a Spin Glass

Note: the Wolynes protein-folding spin-glass model is significantly different from the p-spin Hopfield model that Professor LeCun discusses because it contains multi-scale, multi-spin interactions. 

Screen Shot 2015-06-21 at 9.28.24 AM

Spin glasses and spin funnels are quite different. For example, spin glasses are highly non-convex with lots of local minima, saddle points, etc. Spin funnels, however, are designed to find the spin glass of minimal-frustration and have a convex, funnel shaped, energy landscape. Furthermore, they have been characterized and cataloged by theoretical chemists and can take many different convex shapes,

This seemed to be necessary to resolve one of the great mysteries of protein folding: Levinthal’s paradox [5]. If nature just used statistical sampling to fold a protein, it would take longer than the “known” lifetime of the universe. This is why machine learning is not just statistics.

Spin Funnels and Spin Glasses

Spin Funnels (DFM) vs Spin Glasses (SG) [4]

Deep Learning Networks are (probably) Spin Funnels

With a surface like this, it is not so surprising that an SGD method might be able to find the energy minima (called the native state in protein folding theory). One just needs to descend until the top of the funnel is reached. After that point, the native state is a straight shot down.  This, in fact, defines a so-called “folding funnel” [4].

So it is not surprising at all that SGD may work.

Recent research at Google and Stanford confirms that the deep learning energy landscapes appear to be roughly convex [6], as does Professor LeCun’s work on spin glasses.

Note that a real theory of protein folding, which would actually be able to fold a protein correctly (i.e. Freed’s approach [7]), would be a lot more detailed than a simple spin glass model.

Likewise, true deep learning systems are going to have a lot more engineering details–to avoid overtraining (dropout, pooling, momentum) than a theoretical spin funnel.Indeed, it is unclear what happens at the bottom of the funnel.

Does the system exhibit a spin glass transition (with full blown replica symmetry breaking, as Professor LeCun suggests), or is the optimal solution more like a simple native state defined by only a few low-energy configurations?

Do deep learning systems display a phase transition, and is it first order like protein folding? We will address these details in a future article. In any case, it is not that Deep Learning is non-convex, but rather that we need to avoid over-training.

Thankfully,  it seems modern systems can do this using a few simple tricks, such as rectified linear units and dropout.

Hopefully, we can learn something using the techniques developed to study the energy landscape of   multi-scale spin glass and spin funnels models, thereby utilizing methods from theoretical chemistry and condensed matter physics [8,9].


Written by Dr. Charles Martin

Edited by Alexander Fleiss, Hantong Wu, Ryan Cunningham, Gihyen Eom & Tianyi Li