A New Theory Of AI
Imagine you want to predict the generalization quality of some (pre) trained DNN model you have, like GPT, FlanT5, Bloom, ,etc. How can this be done ? And say you don’t have access to the test or even the training data. Moreover, you really want to understand how the individual layers are performing.
Recently, tt has been shown that, using the open source weightwatcher tool one can “Predict [the] trends in the quality of state-of-the-art neural networks without access to training or testing data” (published in Nature !.). That sounds amazing, however, how could this possibly work?
Here, we show why weightwatcher works.
Let’s write the quality of a model DNN as a sum of contributions from each layer Qt.
1) We can formulate the layer quality as a matrix generalization of the old student-teacher model from statistical mechanics for the free energy of a linear perceptron (Phys. Rev. A 45, 6056 (1992) – Statistical mechanics of learning from examples (aps.org))
Call the model you have the ‘Teacher’ (T), and imagine training all possible students (S) that can mimic the Teacher. The layer quality (or free energy) Qt is an integral of the Student-Teacher overlap, where we integrate over all student matrices that resemble the teacher. That is, this is an integral over all random matrices (S) that ‘look’ like the Teacher’s weight matrix.
OK, so how do we compute an integral over random matrices?
Not easy, however, this is not unknown to quants, thanks to recent advances in Random Matrix Theory (RMT).
2) Apply the Annealed Approximation (or Jensen’s inequality) to re-write Qt as something like a free energy.
3) Assume most of the generalizing components are from the largest eigenvalues, or the tail of the ESD (i.e empirical spectral density). Apply a volume-preserving transformation to convert Qt to an HCIZ integral. That is, a random matrix integral we know how to evaluate.
And there is good experimental evidence to show this transformation is reasonable.
Once done, Qt can be evaluated semi-empirically (i.e phenomenologically) using an idea suggested in the theoretical physics literature by Galluccio, Bouchaud & Potters ( [cond-mat/9801209] Rational Decisions, Random Matrices and Spin Glasses (arxiv.org) ) and, later worked out formally by Tanaka (tt-iw-smi2007.dvi (iop.org)).
Moreover, in the large N limit, the expected value of the HCIZ integral (above) is a sum of ‘generalized norms G’, that only depend on the eigenvalues of the actual Teacher layer weight matrix W=T
Because the weight matrices W are strongly correlated, heavy-tailed objects (similar to the covariance matrices in finance), we can fit the distribution of their eigenvalues to a power law (with exponent alpha). That is, we we have
That’s all there is to it!
The weighwatcher alpha-hat metric is basically the (conditional) likelihood , or free energy, of that specific layer generalizing well.
Hope you enjoyed nerding out on some mathematical physics.
In conclusion, for more details check out the weightwatcher.ai project: https://weightwatcher.ai
And lastly, if you need help with your AI projects, reach out to Dr. Charles Martin, who has been featured here previously on Rebellion Research!
I do machine learning, deep learning, data science, and AI software development, and with extensive domain experience in NLP for Search Relevance (as well as Text Generation and Quantitative Finance).
I have personally developed machine learning (ML) systems and helped get them into production at companies including Roche, France Telecom, GoDaddy, Aardvark (Google), eBay, eHow, Walmart, Barclays/BGI, and Blackrock. Recently I was both a consultant and FTE distinguished engineer at GLG, a very prestigious international consulting firm, where I developed AI methods for the search and recommendations platform.
I provide scientific consulting to the Page family office at the Anthropocene Institute, advising on areas of nuclear and quantum technologies with an eye toward climate change.
I do scientific research in collaboration with UC Berkeley on the foundations of AI and am the lead on the weightwatcher project: https://weightwatcher.ai
In 2011, I helped Demand Media / eHow become the first $1B IPO since Google:
I was at Aardvark, acquired by Google and featured in the Lean StartUp, and a subject matter expert (SME) in ML at eBay.
At my first startup, in the late 90s, I developed semi-supervised ML algo for personalized search.
I offer over 15 years of commercial data science, software engineering, and ML experience. I am full stack developer for web, object oriented, and numerical programming. This includes java, ruby, python–and even dev ops.
In addition, I was coding on the Cray XMP in High School. Furthermore, I invented a Monte Carlo method for non-eq condensed matter systems and published–at 19. I am a national math contest winner.
My PhD is in Theoretical Chemistry from U Chicago. I was an NSF Fellow (1 of 2 nationwide).
Ask me ML questions on Quora:
Set up a private call on Clarify.fm
image: copyright (C) http://www.123rf.com/profile_eraxion
Specialties: Computational Advertising. Deep Learning. Machine Learning, Data Science, Mathematical modeling, Search Relevance. Predictive Analytics. Data Science. NLP Time series analysis. Quant strategies. Quantum Chemistry
Java Python, Ruby, C, Fortran
SQL, NoSQL, Hadoop, Redis, AWS, Chef, EC2
A New Theory Of AI