Five days ago, Grant Sanderson a.k.a. 3Blue1Brown has started uploading a series of videos concerning the functionality of today’s chatbots such as ChatGPT or Llama. First, he uploaded an explainer of what GPT is and how the broad architecture of the model is laid out, before today releasing a video that decisively focuses on the attention mechanism.
This is a perfect opportunity to do something that I haven’t done before despite wanting to. But it appears I just forgot until now. Specifically, two years ago I created a large poster that is hanging next to my desk in which I tried to explain to myself how a transformer works. Despite many having attempted to visualize the transformer architecture, they often left out a lot of detail that I personally would’ve loved to include. After having watched 3Blue1Brown’s videos, I am delighted that Sanderson has chosen a lot of the same visual metaphors for his explainer, so if you’ve already seen his videos, you should be very familiar with what the poster looks like.
There is one large difference, however: In his most recent video, Sanderson mentions how he will only focus on generative transformers, and not the “original” translation transformer from the 2017 paper “Attention is all you need”. 3Blue1Brown only focuses on the decoder-only architecture of GPT, not the encoder/decoder architecture that is used for translation.1 And this is precisely what my poster focuses on. I meant to upload it ages ago, so it’s a great coincidence that I have been reminded about this with his videos.
Thus, without too much additional noise (because I did produce a very long text two days ago), here’s the poster for you!