Researchers are making AI models fit in your pocket and run locally

Large AI models are the brains behind many of our favorite tools, from chatbots to translation apps. But these models can be huge – usually too big to fit anywhere but on distant servers. So each time you send a request, the data has to go to these servers and back. It works, yes, but it’s not fast, cheap, or entirely private.

What if you could eliminate all that data being sent back and forth?

This is exactly what researchers from Stanford and Princeton looked into. And they believe they have a way to fix this. Their new approach, called CALDERA, compresses massive AI systems, without making them dumber, to run on smaller, resource-constrained devices locally. Think of it as turning a hardcover book into a travel-friendly paperback – it’s the same story, just easier to carry.

“When you use ChatGPT, whatever request you give it goes to the back-end servers of OpenAI, which process all of that data, and that is very expensive,” said coauthor Rajarshi Saha, a Stanford Engineering Ph.D. student. “So, you want to be able to do this LLM inference using consumer GPUs [graphics processing units], and the way to do that is by compressing these LLMs.”

And they’re not the only ones chasing this idea. Across the tech world, a quieter revolution is happening. Companies and researchers are racing to make AI smaller, faster, and more efficient. The stakes are high: whoever cracks the code could change how we interact with AI.

Small Is the New Big

CALDERA’s approach focuses on cutting out redundancies in the AI’s “memory” while lowering the precision of its calculations. The result? A model that’s lighter and quicker but still gets the job done. Tests with Meta’s popular LLaMa-2 and LLaMa-3 models showed impressive results—compressed versions performed almost as well as their bulky counterparts.

“Any time you can reduce the computational complexity, storage and bandwidth requirements of using AI models, you can enable AI on devices and systems that otherwise couldn’t handle such compute- and memory-intensive tasks,” said study coauthor Andrea Goldsmith

Microsoft is also playing in this space with Phi-3-mini, a compact version of its larger models designed to run on laptops and even some smartphones.

Why It Matters

Shrinking AI isn’t just about convenience; it’s about accessibility and trust. Smaller models can run directly on your phone or laptop, meaning no more data zipping off to servers. Your information stays with you. And if it stays with you, there’s less risk of privacy breaches.

Efficiency is another big win. Smaller models use less energy, allowing them to run on devices like smartphones. Today, most people wouldn’t dream of running a full-sized AI on their phone—it would drain the battery in minutes. But compressed models could make that possible, creating a world where AI tools are truly personal.

What’s Next?

The race to shrink AI is far from over. Challenges remain, especially around energy use and memory demands. And while CALDERA and others are making progress, it’s clear there’s no one-size-fits-all solution.

Still, the benefits are too big to ignore. Running compressed LLMs locally on devices reduces computational power and memory – reducing costs, making these tools more affordable, and lowering the carbon footprint of LLMs. It will be interesting to see how researchers and companies move forward.


The researchers will present their paper, Compressing Large Language Models using Low Rank and Low Precision Decomposition,” at the Conference on Neural Information Processing Systems (NeurIPS) next month. 


You may be interested in: “AI headphones create ‘sound bubble’ to focus on nearby sounds in noisy spaces”