Needle: The 26M-Parameter Model That Brings Gemini-Grade Tool Calling to Your Phone

Bitautor
·
·
4 min read
Share
Needle: The 26M-Parameter Model That Brings Gemini-Grade Tool Calling to Your Phone

What if you could run a model that understands tool calling — the ability to translate "What's the weather in San Francisco?" into a structured API call — on a device smaller than your phone?

That's exactly what the team at Cactus Compute has achieved with Needle, a 26-million-parameter "Simple Attention Network" that distills Gemini 3.1's tool-calling abilities into a footprint so small it runs on laptops, phones, and even smart glasses.


The Big Story: Tool Calling at 14 MB

Needle isn't just another small model. It represents a fundamentally different approach to making AI useful on consumer hardware. The team distilled Gemini 3.1's function-calling ability into a specialized architecture — a Simple Attention Network — that achieves:


  • 6,000 tokens/sec prefill speed in production on Cactus infrastructure
  • 1,200 tokens/sec decode — fast enough for real-time interactions
  • INT4 quantized weights at just 14 MB — smaller than most images on the web
  • Full trainability on a Mac or PC — you can finetune it on your own tools

The model was pretrained on 16 TPU v6e chips for 200 billion tokens (27 hours), then post-trained on 2 billion tokens of single-shot function-call data (45 minutes). The weights are fully open on Hugging Face, along with the dataset generation pipeline.

What makes Needle particularly compelling is its focused design. Rather than being a general-purpose chatbot, it's purpose-built for one thing: taking a natural language query plus a set of tools, and outputting the correct structured function call. In benchmarks, it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM 2.5-350M on single-shot function calling for personal AI use cases.

Why This Matters: On-Device AI Is Getting Real

The implications for edge AI are significant. For months, the narrative around "AI on device" has been dominated by large models getting smaller — but most remain in the 1B–8B parameter range, still too heavy for truly portable scenarios.
Needle shows there's another path: instead of shrinking a generalist model, build a specialist from scratch. A 26M model that nails one critical capability — tool calling — could be the key to unlocking genuinely useful AI assistants that run entirely on your phone, watch, or AR glasses, with no cloud dependency or latency.

As one Hacker News commenter noted, this opens the door to "building something like a command line program where you can optionally just specify the arguments in natural language." Think toolcli add tom to teamfutz group being parsed directly on-device with no internet connection.

Claude Goes to Law School

In a parallel story from the enterprise AI front, Anthropic announced this week that Claude can now connect to a range of legal industry tools, including DocuSign, Box, Thomson Reuters, and Harvey. Lawyers can use Claude to review contracts, surface case law, and draft documents directly within their existing workflow tools.

This is a significant expansion for Anthropic's enterprise strategy. The legal industry has been one of the most cautious adopters of generative AI due to confidentiality concerns and accuracy requirements. By integrating directly with tools lawyers already use, rather than asking them to adopt a new interface, Claude positions itself as an assistant that augments rather than disrupts.

OpenAI's Safety Committee Confirms Model Delays

The Verge's live coverage of the Musk v. Altman trial also brought a notable revelation: Dr. Jeremy "Zico" Kolter, chair of OpenAI's safety and security committee, confirmed that the committee has formally requested delays of model releases on two occasions so far.

Kolter detailed OpenAI's layered safety structure: the Safety Systems team (guardrails and evaluations), the Preparedness team (the OpenAI Preparedness Framework), the Alignment team (alignment with human values), the Model Policy team (the model spec), and investigative teams. He employed roughly 200 people working on safety.

The disclosure of formal model delays adds weight to the ongoing debate about whether AI companies have sufficient internal checks and balances — and whether those checks survive competitive pressure.

Your Takeaway This Week

Three stories, one theme: AI is getting both more capable and more specialized:

  1. Needle proves that distillation isn't just about bigger models getting smaller — it's about creating entirely new model families optimized for specific, high-value tasks.
  2. Claude for legal shows that the next growth frontier isn't consumer chatbots but deeply integrated enterprise assistants.
  3. OpenAI's safety delays remind us that even as models improve, the governance infrastructure around them is still being built in real time.

Sources:

Related Topics

small ai models
function calling
llm distillation
simple attention network
hugging face
on-device ai
tool calling
edge ai
ai models
enterprise ai
ai safety
ai governance

Was this article helpful?

Found outdated info or have suggestions? Let us know!

Discover more insights and stay updated with related articles

Discover AI Tools

Find your perfect AI solution from our curated directory of top-rated tools

Less noise. More results.

One weekly email with the ai news tools that matter — and why.

No spam. Unsubscribe anytime. We never sell your data.

What's Next?

Continue your AI journey with our comprehensive tools and resources. Whether you're looking to compare AI tools, learn about artificial intelligence fundamentals, or stay updated with the latest AI news and trends, we've got you covered. Explore our curated content to find the best AI solutions for your needs.