Google’s Latest App Lets Your Phone Run AI in Your Pocket—Entirely Offline

0

Google's Latest App Lets Your Phone Run AI in Your Pocket—Entirely Offline

Google has released a new app that nobody asked for, but everyone wants to try.

The AI Edge Gallery, which launched quietly on May 31, puts artificial intelligence directly on your smartphone—no cloud, no internet, and no sharing your data with Big Tech’s servers.

The experimental app—released under the Apache 2.0 license, allowing anyone to use it for almost anything—is available on GitHub, starting with the Android platform. The iOS version is coming soon.

It runs models like Google’s Gemma 3n entirely offline, processing everything from image analysis to code writing using nothing but your phone’s hardware.

And it’s surprisingly good.

The app, which appears to be aimed at developers for now, includes three main features: AI Chat for conversations, Ask Image for visual analysis, and Prompt Lab for single-turn tasks such as rewriting text.

Users can download models from platforms like Hugging Face, although the selection remains limited to formats such as Gemma-3n-E2B and Qwen2.5-1.5 B.

Google's Latest App Lets Your Phone Run AI in Your Pocket—Entirely Offline

Reddit users immediately questioned the app’s novelty, comparing it to existing solutions like PocketPal.

Some raised security concerns, though the app’s hosting on Google’s official GitHub counters impersonation claims. No evidence of malware has surfaced yet.

We tested the app on a Samsung Galaxy S24 Ultra, downloading both the largest and smallest Gemma 3 models available.

Each AI model is a self-contained file that holds all its “knowledge”—think of it as downloading a compressed snapshot of everything the model learned during training, rather than a giant database of facts like a local Wikipedia app. The largest Gemma 3 model available in-app is approximately 4.4 GB, while the smallest is around 554 MB.

Once downloaded, no further data is required—the model runs entirely on your device, answering questions and performing tasks using only what it learned before release.

Even on low-speed CPU inference, the experience matched what GPT-3.5 delivered at launch: not blazing fast with the bigger models, but definitely usable.

The smaller Gemma 3 1B model achieved speeds exceeding 20 tokens per second, providing a smooth experience with reliable accuracy under supervision.

This matters when you’re offline or handling sensitive data you’d rather not share with Google or OpenAI’s training algorithms, which use your data by default unless you opt out.



GPU inference on the smallest Gemma model delivered impressive prefill speeds over 105 tokens per second, while CPU inference managed 39 tokens per second. Token output—how fast the model generates responses after thinking—reached around 10 tokens per second on GPU on average and seven on CPU.

The multimodal capabilities worked well in testing.

Additionally, it appears that CPU inference on smaller models yields better results than GPU inference, although this may be anecdotal; however, this has been observed in various tests.

For example, during a vision task, the model on CPU inference accurately guessed my age and my wife’s in a test photo: late 30s for me, late 20s for her.

The supposedly better GPU inference got my age wrong, guessing I was in my 20s (I’ll take this “information” over the truth any day, though.)

Google's Latest App Lets Your Phone Run AI in Your Pocket—Entirely Offline

Google’s models come with heavy censorship, but basic jailbreaks can be achieved with minimal effort.

Unlike centralized services that ban users for circumvention attempts, local models don’t report back about your prompts, so it can be a good practice to use jailbreak techniques without risking your subscription or asking the models for information that censored versions will not provide.

Google's Latest App Lets Your Phone Run AI in Your Pocket—Entirely Offline

Third-party model support is available, but it is somewhat limited.

The app only accepts .task files, not the widely adopted .safetensor format that competitors like Ollama support.

This significantly limits the available models, and although there are methods to convert .safetensor files into .task, it’s not for everybody.

Code handling works adequately, although specialized models like Codestral would handle programming tasks more effectively than Gemma 3. Again, there must be a .task version for it, but it can be a very effective alternative.

For basic tasks, such as rephrasing, summarizing, and explaining concepts, the models excel without sending data to Samsung or Google’s servers.

So, there is no need for users to grant big tech access to their input, keyboard, or clipboard, as their own hardware is handling all the necessary work.

Google's Latest App Lets Your Phone Run AI in Your Pocket—Entirely Offline

The context window of 4096 tokens feels limited by 2025 standards, but matches what was the norm just two years ago.

Conversations flow naturally within those constraints. And this may probably be the best way to define the experience.

Considering you are running an AI model on a smartphone, this app will provide you a similar experience to what the early ChatGPT provided in terms of speed and text accuracy—with some advantages like multimodality and code handling.

But why would you want to run a slower, inferior version of your favorite AI on your phone, taking up a lot of storage and making things more complicated than simply typing ChatGPT.com?

Privacy remains the killer feature. For example, healthcare workers handling patient data, journalists in the field, or anyone dealing with confidential information can now access AI capabilities without data leaving their device.

“No internet required” means the technology works in remote areas or while traveling, with all responses generated solely from the model’s existing knowledge at the time it was trained..

Cost savings add up quickly. Cloud AI services charge per use, while local models only require your phone’s processing power. Small businesses and hobbyists can experiment without ongoing expenses. If you run a model locally, you can interact with it as much as you want without consuming quotas, credits, or subscriptions, and without incurring any payment.

Latency improvements feel noticeable. No server round-trip means faster responses for real-time applications, such as chatbots or image analysis. It also means your chatbot won’t ever go down.

Overall, for basic tasks, this could be more than enough for any user, with the free versions of ChatGPT, Claude, Gemini, Meta, Reka, and Mistral providing a good backup when heavier computation is needed.

Of course, this won’t be a substitute for your favorite internet-connected chatbot anytime soon. There are some early adoption challenges.

Battery drain concerns persist, especially with larger models; setup complexity might deter non-technical users; the model variety pales in comparison to cloud offerings, and Google’s decision not to support .safetensor models (which account for almost 100% of all the LLMs found on the internet) is disappointing.

However, Google’s experimental release signals a shift in the philosophy of AI deployment. Instead of forcing users to choose between powerful AI and privacy, the company’s offering both, even if the experience isn’t quite there yet.

The AI Edge Gallery delivers a surprisingly polished experience for an alpha release. Google’s optimization demonstrates the creation of probably the best UI available for running AI models locally.

Adding .safetensor support would unlock the vast ecosystem of existing models, transforming a good app into an essential tool for privacy-conscious AI users.

Edited by Josh Quittner and Sebastian Sinclair

Source

Leave A Reply

Your email address will not be published.