sttbAdmin - 23 August 2024 - 19h10

Undoubtedly, AI is changing our world, integrating into more aspects of our lives. From virtual assistants to fitness trackers and smart home devices, our lives are becoming more convenient, efficient, and connected.

However, if we want AI to be accessible and benefit everyone, it needs to be optimized, to be able to run efficiently on a wide range of devices. This is where quantization comes in.

What is quantization

If we choose a technical definition, quantization is:

The process of mapping input values from a large set to output values in a smaller set — Wikipedia

In non-tech terms, quantization takes large input values and reduces them to access data faster, without losing too much information.

Without knowing it, we use quantization principles in our daily life. When your friend asks for the price of that $4.99 article, and you reply with “it’s 5 bucks”. You use rounding to reduce a “complex” number and simplify it to convey the data faster. The level of detail is still enough for your friend to understand the information.

We also truncate time. Who says “Dude I’m hungry, it’s already 12:01:32!”?
You just go by “It’s 12”. It’s faster to say (and to write, if you were to), but you’re trading a little precision for more speed.

Video processing uses the same idea. An HD video has much more details, but it’s slower to download due to the additional information needed. If you don’t need that 4k quality, the SD version will get you mostly the same content… but faster.

Photo by Marc-Olivier Jodoin on Unsplash

Quantization in AI

In AI, the idea is to reduce the precision of the numbers used to represent the model’s parameters.

Let’s say a model uses 16-bit floating-point numbers (which each takes 2 bytes in memory), these numbers can be quantized to use 4-bit integers. They now use only 0.5 byte in memory, that’s an x4 improvement.

This reduction in precision thus drastically decreases the model’s size and speed.

Tech notes

2 main quantization types are used:

Post-training quantization: after the model has been trained. It converts the weights and activations of the model from higher precision to lower precision.
It’s like processing a video in 4K but exporting a SD version.
Quantization-aware training (QAT): this method trains the model with quantization in mind. The precision can be lowered during training, which usually gets better results than post-training quantization.
It’s like starting a painting, limiting yourself to a reduced palette of colors.

Why is it needed

Quantization makes AI models smaller, faster, and more energy-efficient.

You could argue that, models being mostly cloud-based, that’s not a consumer’s preoccupation. However, we expect models to achieve more and more (complex) tasks.

In 2020, GPT3, a text-only model, already had 175 billion parameters. Released 10 days ago, GPT-4o leverages multimodality (the ability to use more than text, as input and output: images, audio, videos, documents, etc.).
While the exact number has not been disclosed, experts tend to assume it has more than 1 trillion parameters!

To cope with growing demand and new capabilities, while offering a decent user experience, these models must be optimized.

Moreover, before GPT-4o, multimodal models were not free (and we don’t know how openAI’s offer will evolve yet). The only way to use these models was to have a powerful machine and strong technical skills.

AI for everyone, everywhere

Let’s take Mistral.ai, for instance. It’s designed to be a lightweight but performant model.

The latest model, Mistral-7B (for 7B parameters) would weigh 14Gb with a 16-bit float precision (7B x 2 bytes). By lowering the precision to a 4-bit integer, the weight is divided by 4: 3.5Gb (7B x 0.5 bytes).

Easier to fit in any smartphone, right?
The model needs less storage, fewer resources, and is much more environmentally sustainable (using less battery).

Since it can operate locally, (no need to transfer large quantities of data to cloud servers), it also costs less.

Sure, it comes with a slight loss in accuracy. But it’s often negligible for most consumer needs.

Final words

Quantization makes AI models smaller and faster, in exchange for a slight loss in precision. New models can now run locally on almost any modern device (smartphones, autonomous vehicles, wearable devices…), making AI more accessible and affordable.

For cloud-based models, quantization allows them to cope with increasing demand and growing capacities, while enhancing user experience, with faster responses.

As research and development in the field continue, we will undoubtedly see even more innovation and widespread adoption, democratizing AI and bringing its benefits to everyone.

CATEGORIES:

AI

Tags:

No tags