One of the insights the public could gain with regard to deep learning methods over the past years is that the bigger those models get, the better they become. What every scientist in their right mind would call bogus, and what would be considered p-hacking if it were about dataset sizes, has become a common stock of machine learning at least since the inception of the transformer model a few years ago (Vaswani et al. 2017): Making your models bigger will automatically make them better.
Don’t get me wrong, there are enough voices out there focusing on the problems of that strategy (e.g. Bender et al. 2020; Whittaker 2021), but since 2017, every iteration of deep learning models has both become bigger and better — from BERT to GPT-3 it was all about doubling parameter counts.
However, there are a few indications that this trend might be supplanted with more reasonable approaches, such as “Gato” (Reed et al. 2022), a model that apparently is capable of fulfilling multiple tasks with the same set of weights.
Nevertheless, deep learning still involves a lot of black magic. I myself made such a completely ridiculous experience just yesterday.
I have a model that a colleague and me developed back in the beginning of 2021 where we created a small LSTM-based sentence-classifier. Back then this was a huge success: For the first time did I not only write my very own neural network, but it also worked excellently. We received huge accuracy scores, it was capable of a lot of inference and every additional feature we modeled into the LSTM network made it even better.
After having let it rest for the better part of a year, I wanted to present the model alongside a method on a few conferences this year; the first of which is approaching rapidly. So I decided to take another look at the code. Having had to code a lot of things in the past year, I realized that many parts of the model’s code were either inefficient or just plainly inelegant.
So I decided to clean it up. I discovered some weird logical bugs and other problems, and after having fixed them, two things happened: First, the code lost half of its lines of code (no, this is not a comical story about removing a misbehaving but essential function), and second, the model suddenly was one of the worst I’ve seen.
I did the only reasonable thing I could do: I panicked. And then took a very long look at my code again. I fixed a few newly introduced bugs, none of which affected the model performance, however.
Then, over the weekend, I decided to go full geronimo: Since we all know that more is better in deep learning, I decided to build another classifier to pre-train some part of our model. It took me about four hours (luckily I had the skeleton for that classifier already lying around) and then I re-ran my actual classifier, utilizing the pretrained weights.
And, lo and behold: The classifier did what it was supposed to do and what it did a year ago.
I was baffled.
Why does a classifier that apparently only accidentally worked well in the first place, suddenly become better if you literally stack just another classifier on top of it?!
This must be witchcraft. In general (and I believe computer science graduates will probably see this differently), deep learning is more of an art than a science, and behind all that trial & error you have to get through to end up with a decent classifier, there is a lot of “just make it bigger”.
Of course it’s not at all witchcraft. But why does a second classifier on top of one suddenly make the first one perform much better?
Recently, Microsoft Research published a piece on “neurocompositional computing” (Smolensky et al. 2022). In it, they make one major point: Deep learning classifiers may have gotten better over the past years by simply increasing their size, but the transformer model (Vaswani et al. 2017) already announced the next stage, in which you attempt a two-pronged approach: Instead of just utilizing neural networks (i.e. a connectivist approach) you combine those neural networks with symbolic approaches that divide the work of getting from some input to some output into a set of multiple classifiers that each solve one single, simple problem and whose outputs are then combined to arrive at the solution for a much more complex problem.
Looking at my classifier with this approach in mind, it makes absolute sense why it worked so well: In principle I had two classifiers, the first of which only had the goal to encode co-occurrence patterns of grammatical constructions (basically a Word2Vec implementation, just for grammar, see Levy and Goldberg 2014), and the second of which then just had the problem to take that information in order to classify sentences.
Both problems are in and of themselves very limited in scope: They both have one very clear goal that is easily expressed in algorithmic terms. Therefore, neural networks can be “chained” together. But in their sum, they solve a much more complex problem – with better accuracy than if I just used a single classifier for the problem.
A colleague of mine recently said that there is currently a paradigm shift happening in AI research from purely connectivist approaches towards an approach that also gets the “losing” side of the AI wars on board: the symbolic approach.
The basic difference is just that a connectivist approach just tries to chain as many neural layers together as necessary to solve a problem, whereas the symbolic approach trusts in experts to write elaborate “if/else”-statements that then help a computer solve complex tasks.
In my case, I was the expert who knew that embeddings need to encode co-occurrence patterns (Eisenstein 2018) and who wrote one connectivist neural network to solve that partial problem. Then, I took those pretrained embeddings and stuffed them into another neural network whom I gave a different task: To use the co-occurrence patterns of both words and grammar to classify sentences. And it performed very well.
I think, given that recent piece of research on the neurocompositional approach, this success makes absolute sense, even if we are apparently no longer able to describe why a computer performs well.
That is both witchcraft, and solid computer science.
And that is why I love my job.
References
- Bender, E. M., Gebru, T., McMillan-Major, A., & Mitchell, M. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://doi.org/10.1145/3442188.3445922
- Eisenstein, J. (2018). Natural Language Processing.
- Levy, O., & Goldberg, Y. (2014). Dependency-Based Word Embeddings. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 302–308. https://doi.org/10.3115/v1/P14-2050
- Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J. T., Eccles, T., Bruce, J., Razavi, A., Edwards, A., Heess, N., Chen, Y., Hadsell, R., Vinyals, O., Bordbar, M., & de Freitas, N. (2022). A Generalist Agent (arXiv:2205.06175). arXiv. http://arxiv.org/abs/2205.06175
- Smolensky, P., McCoy, R. T., Fernandez, R., Goldrick, M., & Gao, J. (2022). Neurocompositional computing in human and machine intelligence: A tutorial (p. 78).
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. ArXiv:1706.03762 [Cs]. http://arxiv.org/abs/1706.03762
- Whittaker, M. (2021). The steep cost of capture. Interactions, 28(6), 50–55. https://doi.org/10.1145/3488666