Needle: The 26M-Parameter Model That Brings Gemini-Grade Tool Calling to Your Phone

What if you could run a model that understands tool calling — the ability to translate "What's the weather in San Francisco?" into a structured API call — on a device smaller than your phone?

That's exactly what the team at Cactus Compute has achieved with Needle, a 26-million-parameter "Simple Attention Network" that distills Gemini 3.1's tool-calling abilities into a footprint so small it runs on laptops, phones, and even smart glasses.

The Big Story: Tool Calling at 14 MB

Needle isn't just another small model. It represents a fundamentally different approach to making AI useful on consumer hardware. The team distilled Gemini 3.1's function-calling ability into a specialized architecture — a Simple Attention Network — that achieves:

6,000 tokens/sec prefill speed in production on Cactus infrastructure
1,200 tokens/sec decode — fast enough for real-time interactions
INT4 quantized weights at just 14 MB — smaller than most images on the web
Full trainability on a Mac or PC — you can finetune it on your own tools

The model was pretrained on 16 TPU v6e chips for 200 billion tokens (27 hours), then post-trained on 2 billion tokens of single-shot function-call data (45 minutes). The weights are fully open on Hugging Face, along with the dataset generation pipeline.

What makes Needle particularly compelling is its focused design. Rather than being a general-purpose chatbot, it's purpose-built for one thing: taking a natural language query plus a set of tools, and outputting the correct structured function call. In benchmarks, it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM 2.5-350M on single-shot function calling for personal AI use cases.

Why This Matters: On-Device AI Is Getting Real

The implications for edge AI are significant. For months, the narrative around "AI on device" has been dominated by large models getting smaller — but most remain in the 1B–8B parameter range, still too heavy for truly portable scenarios.
Needle shows there's another path: instead of shrinking a generalist model, build a specialist from scratch. A 26M model that nails one critical capability — tool calling — could be the key to unlocking genuinely useful AI assistants that run entirely on your phone, watch, or AR glasses, with no cloud dependency or latency.

As one Hacker News commenter noted, this opens the door to "building something like a command line program where you can optionally just specify the arguments in natural language." Think toolcli add tom to teamfutz group being parsed directly on-device with no internet connection.

Claude Goes to Law School

In a parallel story from the enterprise AI front, Anthropic announced this week that Claude can now connect to a range of legal industry tools, including DocuSign, Box, Thomson Reuters, and Harvey. Lawyers can use Claude to review contracts, surface case law, and draft documents directly within their existing workflow tools.

This is a significant expansion for Anthropic's enterprise strategy. The legal industry has been one of the most cautious adopters of generative AI due to confidentiality concerns and accuracy requirements. By integrating directly with tools lawyers already use, rather than asking them to adopt a new interface, Claude positions itself as an assistant that augments rather than disrupts.

OpenAI's Safety Committee Confirms Model Delays

The Verge's live coverage of the Musk v. Altman trial also brought a notable revelation: Dr. Jeremy "Zico" Kolter, chair of OpenAI's safety and security committee, confirmed that the committee has formally requested delays of model releases on two occasions so far.

Kolter detailed OpenAI's layered safety structure: the Safety Systems team (guardrails and evaluations), the Preparedness team (the OpenAI Preparedness Framework), the Alignment team (alignment with human values), the Model Policy team (the model spec), and investigative teams. He employed roughly 200 people working on safety.

The disclosure of formal model delays adds weight to the ongoing debate about whether AI companies have sufficient internal checks and balances — and whether those checks survive competitive pressure.

Your Takeaway This Week

Three stories, one theme: AI is getting both more capable and more specialized:

Needle proves that distillation isn't just about bigger models getting smaller — it's about creating entirely new model families optimized for specific, high-value tasks.
Claude for legal shows that the next growth frontier isn't consumer chatbots but deeply integrated enterprise assistants.
OpenAI's safety delays remind us that even as models improve, the governance infrastructure around them is still being built in real time.

Sources:

Cactus Compute / Needle: github.com/cactus-compute/needle
Hacker News Discussion: news.ycombinator.com/item?id=48111896
Anthropic Claude for Legal: claude.com/blog/claude-for-the-legal-industry
The Verge / OpenAI Trial Safety Coverage: theverge.com/ai-artificial-intelligence

What's Next?

Continue your AI journey with our comprehensive tools and resources. Whether you're looking to compare AI tools, learn about artificial intelligence fundamentals, or stay updated with the latest AI news and trends, we've got you covered. Explore our curated content to find the best AI solutions for your needs.

What if you could run a model that understands tool calling — the ability to translate "What's the weather in San Francisco?" into a structured API call — on a device smaller than your phone?

The Big Story: Tool Calling at 14 MB

Why This Matters: On-Device AI Is Getting Real

Claude Goes to Law School

OpenAI's Safety Committee Confirms Model Delays

Your Takeaway This Week

Recommended AI tools

Perplexity

Cursor

Google Cloud Vertex AI

Adobe Firefly

Google AI Studio

Hugging Face

Was this article helpful?

Explore AI News Tools

Understanding LLMs

Compare AI Tools

Top 100 AI Tools

Latest AI News

Stay Updated

The BestAI.org International AI Press Digest: Monday, 30 April 2026

Google Gemini Can Now Create Files Directly — and OpenAI Reveals How Its 'Goblin Problem' Reveals a Deeper Truth About AI

The Intelligence Supercycle: International Daily AI Press Digest for Tuesday April 28, 2026

Discover AI Tools

What's Next?

Compare Tools

Learn AI Basics

AI News Hub

What if you could run a model that understands tool calling — the ability to translate "What's the weather in San Francisco?" into a structured API call — on a device smaller than your phone?

The Big Story: Tool Calling at 14 MB

Why This Matters: On-Device AI Is Getting Real

Claude Goes to Law School

OpenAI's Safety Committee Confirms Model Delays

Your Takeaway This Week

Recommended AI tools

Perplexity

Cursor

Google Cloud Vertex AI

Adobe Firefly

Google AI Studio

Hugging Face

Was this article helpful?

Stay Updated

Continue Reading

The BestAI.org International AI Press Digest: Monday, 30 April 2026

Google Gemini Can Now Create Files Directly — and OpenAI Reveals How Its 'Goblin Problem' Reveals a Deeper Truth About AI

The Intelligence Supercycle: International Daily AI Press Digest for Tuesday April 28, 2026

Discover AI Tools

Less noise. More results.

What's Next?

Compare Tools

Learn AI Basics

AI News Hub