Neural Network Compression

AI is moving to the edge.

There are tremendous advantages to being able to run AI algorithms directly on devices at the edge—e.g., phones, smart speakers, cameras, vehicles—without sending data back and forth from the cloud.

Perhaps most importantly, edge AI enhances data privacy because data need not be moved from its source to a remote server. Edge AI is also lower latency since all processing happens locally; this makes a critical difference for time-sensitive applications like autonomous vehicles or voice assistants. It is more energy- and cost-efficient, an increasingly important consideration as the computational and economic costs of machine learning balloon. And it enables AI algorithms to run autonomously without the need for an Internet connection.

Nvidia CEO Jensen Huang, one of the titans of the AI business world, sees edge AI as the future of computing: “AI is moving from the cloud to the edge, where smart sensors connected to AI computers can speed checkouts, direct forklifts, orchestrate traffic, save power. In time, there will be trillions of these small autonomous computers, powered by AI.”

But in order for this lofty vision of ubiquitous intelligence at the edge to become a reality, a key technology breakthrough is required: AI models need to get smaller. A lot smaller. Developing and commercializing techniques to shrink neural networks without compromising their performance has thus become one of the most important pursuits in the field of AI.

The typical deep learning model today is massive, requiring significant computational and storage resources in order to run. OpenAI’s new language model GPT-3, which made headlines this summer, has a whopping 175 billion model parameters, requiring more than 350 GB just to store the model. Even models that don’t approach GPT-3 in size are still extremely computationally intensive: ResNet-50, a widely used computer vision model developed a few years ago, uses 3.8 billion floating-point operations per second to process an image.

These models cannot run at the edge. The hardware processors in edge devices (think of the chips in your phone, your Fitbit, or your Roomba) are simply not powerful enough to support them.

Developing methods to make deep learning models more lightweight therefore represents a critical unlock: it will unleash a wave of product and business opportunities built around decentralized artificial intelligence.

How would such model compression work?

Researchers and entrepreneurs have made tremendous strides in this field in recent years, developing a series of techniques to miniaturize neural networks. These techniques can be grouped into five major categories: pruning, quantization, low-rank factorization, compact convolutional filters, and knowledge distillation.

Pruning entails identifying and eliminating the redundant or unimportant connections in a neural network in order to slim it down. Quantization compresses models by using fewer bits to represent values. In low-rank factorization, a model’s tensors are decomposed in order to construct sparser versions that approximate the original tensors. Compact convolutional filters are specially designed filters that reduce the number of parameters required to carry out convolution. Finally, knowledge distillation involves using the full-sized version of a model to “teach” a smaller model to mimic its outputs.

These techniques are mostly independent from one another, meaning they can be deployed in tandem for improved results. Some of them (pruning, quantization) can be applied after the fact to models that already exist, while others (compact filters, knowledge distillation) require developing models from scratch.

A handful of startups has emerged to bring neural network compression technology from research to market. Among the more promising are Pilot AI, Latent AI, Edge Impulse and Deeplite. As one example, Deeplite claims that its technology can make neural networks 100x smaller, 10x faster, and 20x more power efficient without sacrificing performance.

“The number of devices in the world that have some computational capability has skyrocketed in the last decade,” explained Pilot AI CEO Jon Su. “Pilot AI’s core IP enables a significant reduction in the size of the AI models used for tasks like object detection and tracking, making it possible for AI/ML workloads to be run directly on edge IoT devices. This will enable device manufacturers to transform the billions of sensors sold every year—things like push button doorbells, thermostats, or garage door openers—into rich tools that will power the next generation of IoT applications.”

Large technology companies are actively acquiring startups in this category, underscoring the technology’s long-term strategic importance. Earlier this year Apple acquired Seattle-based Xnor.ai for a reported $200 million; Xnor’s technology will help Apple deploy edge AI capabilities on its iPhones and other devices. In 2019 Tesla snapped up DeepScale, one of the early pioneers in this field, to support inference on its vehicles.

And one of the most important technology deals in years—Nvidia’s pending $40 billion acquisition of Arm, announced last month—was motivated in large part by the accelerating shift to efficient computing as AI moves to the edge.

Emphasizing this point, Nvidia CEO Jensen Huang said of the deal: “Energy efficiency is the single most important thing when it comes to computing going forward….together, Nvidia and Arm are going to create the world’s premier computing company for the age of AI.”

In the years ahead, artificial intelligence will become untethered, decentralized and ambient, operating on trillions of devices at the edge. Model compression is an essential enabling technology that will help make this vision a reality.

Neural Network Compression

Generative AI

Transformers

“System 2” Reasoning