Large Language Model Inference with PyTorch on Apple Silicon

Hendrik Erz

Abstract: More than two years ago, Apple began its transition away from Intel processors to their own chips: Apple Silicon. The transition has been a sometimes bumpy ride, but after years of waiting, today I feel the ride is coming to an end. In this article, I reflect on the journey behind us.

Published on Friday, April 28th, 2023 by Hendrik | 7 min reading time

Roughly three years ago, Apple announced that it would slowly transition away from Intel-processors to their own processors, called “Apple Silicon”. These processors use an entirely different architecture from Intel-processors (namely, the RISC-based ARM architecture), and as such, it was clear from the outset that this transition would take time.

On the one hand, Apple needed time to fix any quirks of this entirely new processor type. They got to learn lots of things, and many of the issues only became apparent once the processor was out in the wild.

On the other hand, many developers had to work on their software as well. Especially where raw computing power was crucial – such as CAD designing, video editing, or machine learning – software developers had to implement changes to their programs so that they could make full use of the new architecture.

I belonged to the first wave of users to switch to the new M1 chips back in 2020. As a daily user of large language models (LLMs), what I observed from the beginning was the transition of the Python community. Working with LLMs requires tons of computing power, and therefore, every inch of performance is important. In the beginning, working with LLMs on the M1 chips was painful, and it remained like this for several years — until today.

Today, I feel that the transition is finally ending, because PyTorch now has enough support for the Apple Silicon devices that inference even with very large models is blazingly fast. Thus, I decided to write up some notes reflecting on the journey behind us.

In the Beginning there were Errors

As soon as I bought the M1 MacBook Pro back in 2020, I needed to set up a data analysis pipeline. I had just started my PhD and knew that I was going to need the Python ecosystem, since I was starting to home in on text analysis.

Naturally, because Apple Silicon was still so new, support for its new architecture was abysmal. It was so bad, in fact, that at first I simply couldn’t use most data science libraries, because they required the old Intel-architecture.

It took me weeks of trying to figure out how to set everything up properly, but after lots of headaches, I finally managed to do so. This led to my first post on the Apple ARM-transition back in January 2021. This post is to this day my most-read article. Dozens of people have contacted me on Twitter and per e-mail to ask for specific details on how to set everything up.

The start of the transition was indeed everything but a joyful ride.

Progress

Fast-forward half a year. In July 2021 I received more and more requests to write an updated article, since many were wondering: “Is it still this complicated to set everything up?” Thus, I decided to write a follow-up piece which you can find here.

As you can see, only half a year after the beginning of the ARM-transition of the Apple ecosystem, many packages have caught up. The package managers finally knew the combination of macOS and the ARM architecture, and most packages could be installed with similar ease as it was back in the Intel days.

However, one problem remained that was specific to neural networks: Lack of support for Apple’s new GPUs. You may have already heard that running neural networks on a CPU is a bad idea. Most scientists who run neural networks therefore try to get their hands on a GPU by Nvidia. With those, running neural networks is comparatively fast and much less of a hassle than running them on CPUs.

Apple ARM-chips do not have an Nvidia GPU, however. And so, running neural networks was restricted to the CPU which made it impossible to productively work on a Mac with neural networks. So let’s dig in on what this GPU that Apple has created is.

The Metal Performance Shader

When Apple announced its new toy, they made sure to emphasize that it had a built-in “neural engine”. That is just a fancy term for optimized GPUs that were designed to make running neural networks much faster than if you run them on the CPU.

However, in order to use this new, fancy “engine”, one had to implement the necessary code to support it. Apple has released an API for that, the “Metal backend” that allowed developers to run code on these neural cores instead of the CPU. For that, developers had to implement so-called “metal performance shaders”, or MPS. Implementing these was apparently a huge pain.

The main issue was that the developers of PyTorch (the go-to framework for working with neural networks) had to implement every single computation specifically for this Metal backend, and this took time. By the end of 2022, they released PyTorch 1.13 which already came with some support for these new MPS shaders.

However, the support was not enough for running my models, and I frequently encountered errors and other issues. So it was not yet time to utilize this new backend. At the same time, I also got access to a large data center in Sweden where I could utilize proper GPUs, so the issue was not as pressing anymore.

But all the time I monitored the support for MPS on GitHub, and it did seem to progress well.

Just two days ago then, I decided to give it another try. I discovered that PyTorch released their version 2.0 which ships with more MPS support, so chances were that it might actually work. I didn’t want to bet on it, but I mean: hope dies last, right?

PyTorch and Numpy: Optimized for Apple ARM

And, lo and behold: It worked out of the box. In fact, the first time I ran a model on the MPS backend with full GPU acceleration I couldn’t believe my eyes: What had previously taken about 5 hours now took only 3 minutes. This is almost as fast as the GPUs that I have access to in the data center.¹

I don’t really know how to properly convey the impact of this: Now, running even large language models on a consumer laptop is a possibility, and it just shows how performant these new computers are. Think about it: When Google created these transformer networks in 2017, they needed data centers to run these models. Now, only 6 years later, we can literally run the same (and larger) models on a consumer laptop.

This is huge. In the past two days I was experiencing some problems with my models and I had to tinker with them. But instead of pushing my code to some data center, I could train model after model on my computer, making the process much faster and productive.

Don’t get me wrong: There is still work to do, and we’re not 100% there yet. But for many day-to-day tasks, I think it’s safe to say: PyTorch is optimized for Apple’s new chip. And this is just awesome. A long journey finally comes to an end.

Going Forward

Any new technology is always going to face difficulties. People need to get used to it and figure out ways of using it productively and creatively. Apple’s new processors are no exception. But the main credit goes to the community which went from “We have no idea what this Metal thingy is” to implementing enough support that we can now utilize the capabilities of this new chip to its full extent.

It is somewhat ironic that for the past two years I was daily driving an M1 MBP and never once used its GPU cores. It felt like using a Ferrari to commute in a congested downtown. Now that I got the newer M2 Pro chip, I can finally use it. But alas, the important part is: We’re there, after a little more than two years.

Back in 2020, the first step the Python community had to take was to make all libraries available for the Apple M1 chip, which happened around January 2021.

The second step was to make it easy to get all of these packages. That was reached in mid-2021 when I wrote my second article.

But the third step was to optimize the software so that it didn’t just run, but that it did run fast. And this has been reached in early 2023.

P. S. To all other computer manufacturers: Can you please follow suit and migrate from the Intel x86 architecture over to ARM? This way we have longer battery lives while at the same time more powerful machines when we need them. That’d be great. Thanks!

Of course this is a somewhat stupid comparison. I currently own an M2 Pro with 19 GPU cores and 16 GB of shared memory. On the data center, I can use Nvidia A100 GPUs with 48 GB of dedicated VRAM. So there is a huge gap between the two. But nevertheless, at least for my own day-to-day use I feel that the chip in my MacBook is fast enough. ↩