I’ve recently been spending some time getting large language models to work on my no-longer-excellent TITAN V GPU. This is the consumer-grade version of the V100 (it’s basically a lower memory version of it), running on a Ubuntu box with a reasonably powerful 20-core Intel Core i9 CPU from a few years ago.
The folks at PyTorch recently released a package called GPT-Fast, enabling accelerated inference of LLMs on a single GPU. Notably, this method uses (almost) entirely out of the box PyTorch optimizations for accelerated inference, and incorporates a few wildly cited optimizations like Integer Quantization and Speculative Sampling.
However, the GPUs they benchmark on are largely the relatively new A100-class GPUs, which I do not own in a personal capacity and are fairly expensive for both consumers and enterprises to access widely. For example, on AWS, a p4d.24xlarge, the smallest available instance with A100 GPUs, costs $32/hour for 8 GPUs, so roughly $4/hour/GPU if you can manage to use them all, whereas a p3.2xlarge instance (with 1 V100 GPU), costs roughly $3/hour/GPU, and is available more easily than an A100 on the spot market for around half that.
So, I figured I would see how much the performance degrades when running LLM inference on a few-years-old consumer grade Volta-class GPU, perhaps giving some measure of how much the performance we see is limited to top-of-the-line GPUs vs what we could more reasonably expect consumers/companies to have on a wider basis and perhaps be a better benchmark of what future on-device inference speeds could be. This is with the caveat that my Volta GPU being consumer-grade means I have a tiny amount of GPU memory by today’s standards — only about 12 gigs to work with.
Also, one of the big optimizations in the newer (Ampere and beyond) classes of NVIDIA Architecture is support for the bfloat16 type, which improves efficiency and is widely used in LLMs. Unfortunately, Volta-class NVIDIA GPUs don’t support this type, so I had to make some changes to the repo (under review from their maintainers) to give it better support for casting all of the relevant parameters to a supported type for my device. Feel free to give the code a try and see if it works on your GPU of choice.
And with that working, here is the benchmark from a few random prompts I put in:
Model | GPU | Technique | Tokens/Second | Memory Bandwidth (GB/s) |
---|---|---|---|---|
Llama-2-7B | NVIDIA Titan V | Base | OOM | OOM |
NVIDIA Titan V | 8-bit | 78.58 | 540.05 | |
open_llama_7b | NVIDIA Titan V | Base | OOM | OOM |
NVIDIA Titan V | 8-bit | 76.44 | 525.31 | |
Llama-2-7B | NVIDIA A100 (reported) | Base | 104.9 | 1397.31 |
NVIDIA A100 (reported) | 8-bit | 155.58 | 1069.20 |
Unfortunately, at the moment, the Llama-13b model with 8-bit integer quantization is still too large to fit into memory on this older generation of GPU, so we only have the 7b model to work with. In short, it looks like we get a roughly 50% slowdown, both in terms of tokens/second and memory bandwidth, compared to the A100 benchmarks posted in the repo. Depending on the pricing structure for accessing accelerated compute, this tradeoff may or may not be worth it for a user running inference at scale. If batching a large number of queries in parallel, the higher-memory A100 chips enable large batches to run very efficiently. On the other hand, if running on edge hardware, 70 tokens/second is certainly more than fast enough for real time inference with a single user running queries. You can readily acquire Titan Vs for about $500 on eBay, and even the higher memory V100s for under $1000.
As a final aside, I’m not able to get Int4 quantization to work on my machine, and I’m fairly sure it’s because the Volta Architecture predates Int4 tensor core support, but if anyone knows more about this do let me know. As always, do reach out if you have any thoughts or questions.