Needle: The 26M-Parameter Model That Brings Gemini-Grade Tool Calling to Your Phone

What if you could run a model that understands tool calling — the ability to translate "What's the weather in San Francisco?" into a structured API call — on a device smaller than your phone?
That's exactly what the team at Cactus Compute has achieved with Needle, a 26-million-parameter "Simple Attention Network" that distills Gemini 3.1's tool-calling abilities into a footprint so small it runs on laptops, phones, and even smart glasses.
The Big Story: Tool Calling at 14 MB
Needle isn't just another small model. It represents a fundamentally different approach to making AI useful on consumer hardware. The team distilled Gemini 3.1's function-calling ability into a specialized architecture — a Simple Attention Network — that achieves:
- 6,000 tokens/sec prefill speed in production on Cactus infrastructure
- 1,200 tokens/sec decode — fast enough for real-time interactions
- INT4 quantized weights at just 14 MB — smaller than most images on the web
- Full trainability on a Mac or PC — you can finetune it on your own tools
The model was pretrained on 16 TPU v6e chips for 200 billion tokens (27 hours), then post-trained on 2 billion tokens of single-shot function-call data (45 minutes). The weights are fully open on Hugging Face, along with the dataset generation pipeline.
What makes Needle particularly compelling is its focused design. Rather than being a general-purpose chatbot, it's purpose-built for one thing: taking a natural language query plus a set of tools, and outputting the correct structured function call. In benchmarks, it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM 2.5-350M on single-shot function calling for personal AI use cases.
Why This Matters: On-Device AI Is Getting Real
The implications for edge AI are significant. For months, the narrative around "AI on device" has been dominated by large models getting smaller — but most remain in the 1B–8B parameter range, still too heavy for truly portable scenarios.
Needle shows there's another path: instead of shrinking a generalist model, build a specialist from scratch. A 26M model that nails one critical capability — tool calling — could be the key to unlocking genuinely useful AI assistants that run entirely on your phone, watch, or AR glasses, with no cloud dependency or latency.
As one Hacker News commenter noted, this opens the door to "building something like a command line program where you can optionally just specify the arguments in natural language." Think toolcli add tom to teamfutz group being parsed directly on-device with no internet connection.
Claude Goes to Law School
In a parallel story from the enterprise AI front, Anthropic announced this week that Claude can now connect to a range of legal industry tools, including DocuSign, Box, Thomson Reuters, and Harvey. Lawyers can use Claude to review contracts, surface case law, and draft documents directly within their existing workflow tools.
This is a significant expansion for Anthropic's enterprise strategy. The legal industry has been one of the most cautious adopters of generative AI due to confidentiality concerns and accuracy requirements. By integrating directly with tools lawyers already use, rather than asking them to adopt a new interface, Claude positions itself as an assistant that augments rather than disrupts.
OpenAI's Safety Committee Confirms Model Delays
The Verge's live coverage of the Musk v. Altman trial also brought a notable revelation: Dr. Jeremy "Zico" Kolter, chair of OpenAI's safety and security committee, confirmed that the committee has formally requested delays of model releases on two occasions so far.
Kolter detailed OpenAI's layered safety structure: the Safety Systems team (guardrails and evaluations), the Preparedness team (the OpenAI Preparedness Framework), the Alignment team (alignment with human values), the Model Policy team (the model spec), and investigative teams. He employed roughly 200 people working on safety.
The disclosure of formal model delays adds weight to the ongoing debate about whether AI companies have sufficient internal checks and balances — and whether those checks survive competitive pressure.
Your Takeaway This Week
Three stories, one theme: AI is getting both more capable and more specialized:
- Needle proves that distillation isn't just about bigger models getting smaller — it's about creating entirely new model families optimized for specific, high-value tasks.
- Claude for legal shows that the next growth frontier isn't consumer chatbots but deeply integrated enterprise assistants.
- OpenAI's safety delays remind us that even as models improve, the governance infrastructure around them is still being built in real time.
Sources:
- Cactus Compute / Needle: github.com/cactus-compute/needle
- Hacker News Discussion: news.ycombinator.com/item?id=48111896
- Anthropic Claude for Legal: claude.com/blog/claude-for-the-legal-industry
- The Verge / OpenAI Trial Safety Coverage: theverge.com/ai-artificial-intelligence
Recommended AI tools
Perplexity
Search & Discovery
Clear answers from reliable sources, powered by AI.
Cursor
Code Assistance
The AI code editor that understands your entire codebase
Google Cloud Vertex AI
Data Analytics
Gemini, Vertex AI, and AI infrastructure—everything you need to build and scale enterprise AI on Google Cloud.
Adobe Firefly
Image Generation
Create your way with Adobe Firefly—AI for every creative vision.
Google AI Studio
Productivity & Collaboration
The fastest way to build AI-first applications with Google Gemini.
Hugging Face
Scientific Research
Democratizing good machine learning, one commit at a time.
Was this article helpful?
Found outdated info or have suggestions? Let us know!


