📚 Course

Beginner–Intermediate

~2–3h

Local & Private AI

Run Models on Your Own Machine. Zero Cloud. Full Control.

Every prompt you type into ChatGPT, Claude, or Gemini travels to a server, gets stored, and may be used for training. Local AI changes that: your data never leaves your device. This course teaches you to set up Ollama and LM Studio, understand quantization, choose the right model for your hardware, and understand exactly what privacy guarantees local AI does and doesn't provide.

Beginner–Intermediate

~2–3 hours

8 Modules

TL;DR:

Local AI runs entirely on your hardware — no internet required, no data sent to the cloud. The key tool is Ollama (command-line, developer-friendly) or LM Studio (graphical, beginner-friendly). Both use GGUF files — compressed model weights that run efficiently without a GPU. You'll learn to pick models by RAM, understand what Q4_K_M vs Q8_0 means, and build a workflow that keeps sensitive data truly private.

Guide by

Albert Schaper

Why run AI locally?

Cloud AI is fast and powerful. But it comes with trade-offs that matter for many users and organizations. Here are the three core reasons to choose local:

True Privacy

Your prompts, documents, and outputs stay on your device. No server logs, no training data contribution, no third-party data processor. Critical for legal work, medical data, source code, and confidential business information.

Zero Subscription Cost

After the one-time cost of downloading a model (typically 2–40 GB), inference is free. Heavy users saving $20–$100/month in API costs typically break even within weeks. No per-token billing.

Offline & Reliable

Works on planes, in remote locations, in air-gapped environments. No API outages, no rate limits. If your use case involves processing in sensitive networks, local is the only option.

The honest trade-off: Local models are smaller and slower than frontier cloud models (GPT-4o, Claude 3.5 Sonnet). You're trading raw capability for privacy and control. For most everyday tasks — summarization, writing assistance, Q&A, code completion — the quality gap is smaller than you'd expect.

The GGUF format: how local models are packaged

When you download a local AI model, you're downloading a GGUF file — a binary format developed by the llama.cpp project. GGUF stands for GPT-Generated Unified Format. It packs everything the inference engine needs into a single file:

Model weights

The billions of numbers that define how the model thinks, compressed using quantization.

Tokenizer

The vocabulary and rules for converting text into numbers the model processes.

Prompt template

The exact format the model expects for system prompts, user turns, and assistant turns.

Metadata

Architecture type, context window size, license, and other configuration values.

Ollama and LM Studio both use GGUF under the hood via the llama.cpp inference engine, which has been optimized to run on CPUs (including Apple Silicon's Metal and standard x86 AVX2) without requiring a discrete GPU. You can find GGUF models on Hugging Face by searching for “GGUF” + any model name.

Quantization demystified: Q4, Q5, Q8

A full-precision AI model stores each weight as a 16-bit or 32-bit floating point number. Quantization compresses these to fewer bits — trading a small amount of accuracy for a dramatic reduction in file size and RAM usage. The format code tells you exactly what kind of compression was used.

Format	Bits/weight	RAM for 7B model	Quality loss	Best for
Q4_K_M Recommended	4-bit (K-means)	~4 GB	Very small (+0.05 perplexity)	Daily use — best balance
Q5_K_M	5-bit (K-means)	~5 GB	Minimal (+0.035 perplexity)	When Q4 feels slightly weak
Q6_K	6-bit (K-means)	~6 GB	Near-negligible	High-quality writing tasks
Q8_0	8-bit (linear)	~8 GB	Near full-precision	When quality is critical
FP16	16-bit (full)	~14 GB	None (baseline)	Fine-tuning, research

Practical rule: Start with Q4_K_M. It uses K-means grouping which distributes quantization error more evenly than simple Q4_0. 2025 benchmarks show Q4_K_M performs identically to Q8_0 on math reasoning (GSM8K) and loses only marginally on instruction-following tasks. Only upgrade to Q8_0 if you notice quality issues and have the RAM to spare.

Your hardware: what can it run?

Ollama's architecture loads as many model layers as possible into GPU VRAM, then falls back to CPU for the rest. This means you can run models even without a discrete GPU — it's just slower. The limiting factor is total memory (RAM + VRAM combined).

Most laptops

4–8 GB RAM

Phi-4 Mini (3.8B)

~8–15 tokens/sec

MacBook M-series

8–16 GB RAM

Gemma 3 9B, Mistral 7B

~10–25 tokens/sec

M2/M3 Pro/Max

16–32 GB RAM

Mistral Small 24B

~15–30 tokens/sec

Mac Studio / RTX 4090

40 GB+ RAM

Llama 3.3 70B, Qwen2.5 72B

GPT-4 class quality

GPU acceleration: Ollama automatically detects and uses NVIDIA CUDA (driver 531+), AMD ROCm, and Apple Metal. On Apple Silicon, the unified memory architecture means there's no separate VRAM — all RAM is available for the model, which gives MacBooks a significant advantage over similarly specced Windows laptops.

Ollama: from zero to running in 5 minutes

Ollama is an open-source tool that wraps llama.cpp in a clean CLI and local REST API (port 11434). It handles model download, storage, hardware detection, and inference — all automatically.

Install Ollama

Download from ollama.com — available for macOS, Linux, and Windows. The installer sets up a background service that starts automatically.

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download the installer from ollama.com

Run your first model

Ollama downloads the model on first run and caches it in ~/.ollama/blobs/. Subsequent runs load from cache instantly.

ollama run phi4-mini
# Downloads ~2.5 GB on first run, then opens a chat prompt

Essential commands

Manage models and sessions from the terminal.

ollama list              # Show downloaded models
ollama pull gemma3:9b   # Download without running
ollama rm phi4-mini     # Delete a model
ollama ps               # Show currently running models
ollama show gemma3:9b   # Show model info and license

Use the REST API

Ollama exposes an OpenAI-compatible API on localhost:11434. Any app that supports OpenAI can use it — just point the base URL to your local instance.

curl http://localhost:11434/api/chat -d '{
  "model": "phi4-mini",
  "messages": [{"role": "user", "content": "Explain GDPR in 3 sentences."}],
  "stream": false
}'

Try it: Run your first local model

1Install Ollama from ollama.com (takes ~2 minutes).
2Open your terminal and run: ollama run phi4-mini — it downloads ~2.5 GB and opens a chat.
3Ask it: "Summarize what quantization means in 3 sentences." Compare the answer to what you just read.
4Run: ollama list — confirm your model is cached locally.
5Open Activity Monitor (Mac) or Task Manager (Windows) and confirm no network traffic while the model is responding.

Reflect: What surprised you about running a local model? Was the quality better or worse than you expected?

LM Studio: the graphical alternative

LM Studio is a desktop app (Windows, macOS, Linux) that wraps the same GGUF/llama.cpp stack in a graphical interface. No terminal required. It's the better starting point if command-line tools feel unfamiliar.

LM Studio strengths

• Built-in model browser (search Hugging Face directly)

• Visual chat interface with conversation history

• Drag-and-drop GGUF file import

• OpenAI-compatible local server (one toggle)

• Parameter sliders (temperature, context length)

Important caveat

• LM Studio is closed-source (unlike Ollama)

• For highly regulated environments requiring code audits, prefer open-source tools (Ollama, llama.cpp, vLLM)

• Collects minimal telemetry by default — can be disabled in settings

• Model inference itself stays fully local

When to use which: Use LM Studio to explore and experiment. Use Ollama when you want to integrate local AI into scripts, automations, or applications via the API. Many users use both: LM Studio for discovery, Ollama for production use.

Choosing the right local model

The local LLM landscape in 2026 has matured significantly. Here are the top options by use case and hardware tier, based on MMLU, HumanEval, and IFEval benchmarks:

Phi-4 Mini (3.8B)

by Microsoft · ~2.5 GB (Q4_K_M) · 128K tokens context

Start here if RAM < 8 GB

Best reasoning below 4 GB RAM

Phi-4 Mini punches far above its parameter count on reasoning and math. Ideal for users with limited RAM (base MacBook Air, older laptops). Context window of 128K means it handles long documents.

Gemma 3 9B

by Google DeepMind · ~6 GB (Q4_K_M) · 128K tokens context

Recommended for most users

Best quality at 8 GB RAM tier

Strong instruction-following and multilingual support. The best all-rounder for users with 8 GB RAM. Apache 2.0 licensed — fully commercial-use friendly.

Mistral Small 3.1 (24B)

by Mistral AI · ~15 GB (Q4_K_M) · 128K tokens context

For MacBook Pro / 16 GB+ systems

Large step up in quality at 16 GB

Strong across writing, reasoning, and code. The 24B parameter count gives it notably better nuance than 7–9B models. Requires 16 GB RAM at Q4_K_M.

Qwen2.5 72B / Llama 3.3 70B

by Alibaba / Meta · ~40 GB (Q4_K_M) · 128K tokens context

Workstation / Mac Studio only

GPT-4o level quality

Llama 3.3 70B scores ~82% on MMLU — comparable to early GPT-4. Qwen2.5 72B leads on coding (87% HumanEval) and multilingual tasks (29 languages). Requires a workstation, Mac Studio, or RTX 4090 with system RAM overflow.

# 4–8 GB RAM
ollama pull phi4-mini

# 8–16 GB RAM
ollama pull gemma3:9b

# 16–32 GB RAM
ollama pull mistral-small3.1

# 40+ GB RAM
ollama pull llama3.3:70b
ollama pull qwen2.5:72b

Local AI & privacy: what “local” actually guarantees

Running AI locally is not a magic privacy guarantee — but it is a meaningful one. Here's what you actually get:

What IS guaranteed

Your prompts and outputs never leave your machine
No third-party data processor agreement needed (GDPR Article 28 — no external processor)
No training data contribution to the model provider
Audit logs stay in your control
Offline operation: no data exfiltration even if network is compromised

What is NOT guaranteed

Protection from local malware or OS-level data access
The model itself may contain biases or inaccuracies from its training data
LM Studio (closed-source) collects some telemetry by default — disable in settings
Downloaded model weights are typically not encrypted at rest
Full GDPR compliance still requires proper data governance on your end

GDPR note: When data never leaves your device, no data processing agreement with a third party is needed. When combined with full-disk encryption, local AI inference can satisfy GDPR Article 28 requirements for on-premises personal data processing. This is a significant compliance advantage for healthcare, legal, HR, and financial use cases.

Try it: Audit a sensitive workflow

1Think of one task you currently use a cloud AI tool for that involves sensitive data (client info, internal docs, personal data).
2Open Network Monitor or Activity Monitor while running the same task with Ollama. Confirm: no outbound connections during inference.
3Check: does your organization have a data processing agreement with your current AI provider? If not, is the use case compliant?
4Decide: is local AI the right tool for this specific workflow, or does the quality trade-off make cloud AI the better choice?

Reflect: Which of your current AI use cases would benefit most from running locally? What would you lose?

When cloud AI beats local: honest limitations

Frontier reasoning tasks

Complex multi-step reasoning, advanced coding, and nuanced analysis still favor GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. The gap is real, especially for 70B+ workloads.

Multimodal inputs

Vision (analyzing images/documents), real-time voice, and video generation require cloud or dedicated hardware. Llava and similar local vision models exist but trail cloud quality significantly.

Speed on large models

A 70B model at Q4_K_M on a laptop CPU might generate 3–5 tokens/second — slow for interactive use. Cloud inference is 10–50x faster for large models.

Up-to-date knowledge

Local models have a knowledge cutoff from their training date. They won't know about events after that date unless you use RAG (retrieval-augmented generation) with current documents.

The right mental model: Local AI is not “worse cloud AI.” It's a different tool with different trade-offs. Many professionals use both: local models for privacy-sensitive tasks and rough drafts, cloud models for final output quality where data is non-sensitive.

Prompt templates for local AI

Local models respond well to explicit structure. Use these templates as starting points — copy directly into Ollama or LM Studio:

Confidential document summarizer:

You are a professional analyst. Summarize the following document for an internal audience.

Requirements:
- 3–5 bullet points maximum
- Flag any risks or open questions
- Do not add information not present in the document
- Use plain language

Document:
[PASTE DOCUMENT HERE]

Code review assistant (no cloud data leakage):

You are a senior software engineer conducting a code review. Review the following code for:
1. Security vulnerabilities (SQL injection, XSS, SSRF, etc.)
2. Logic errors or edge cases not handled
3. Performance issues
4. Adherence to clean code principles

For each issue found: explain what it is, why it matters, and how to fix it.

Code to review:
[PASTE CODE HERE]

Meeting notes structurer:

You are an executive assistant. Convert these raw meeting notes into a structured document.

Output format:
## Summary (2-3 sentences)
## Key Decisions
## Action Items (owner, due date)
## Open Questions
## Next Meeting

Raw notes:
[PASTE NOTES HERE]

Risks & Responsible Use

Know these before you go further.

Model hallucination is not eliminated by running locally

Local models hallucinate just as much as — or more than — cloud models, because they are typically smaller and less carefully tuned. A Phi-4 Mini confidently giving you a wrong legal citation is still wrong.

What this means for you

Apply the same verification habits to local AI output as you would to cloud AI: never publish facts, legal information, or medical advice without checking primary sources.

Quantization degrades quality unevenly

2025 benchmarks show that Q4 quantization hurts multilingual and instruction-following tasks more than it hurts math. A model that seems fine in English may perform significantly worse in German or French.

What this means for you

If you use local AI for non-English tasks, test quality explicitly in your target language before committing to a workflow. Consider Q5_K_M or Q8_0 for multilingual use.

Local ≠ secure if your device is compromised

If your laptop has malware, keyloggers, or is on a compromised network, local AI offers no additional protection. 'Local' means protected from the AI provider — not protected from everything.

What this means for you

Maintain good device security hygiene: full-disk encryption, OS updates, reputable endpoint protection. Local AI is a privacy tool, not a security cure-all.

Model weights downloaded from untrusted sources can contain malicious code

GGUF files are executable in the sense that the inference engine runs them directly. Malicious quantized models have been distributed via unofficial channels.

What this means for you

Only download models from Ollama's official library or verified Hugging Face repositories (check the organization's verification badge and download counts). Never run GGUF files from random websites.

Loading quiz...

Local AI tools to try

Run these tools locally — no cloud, no subscription, no data leaving your machine

Ollama

CLI + REST API for running local models. Best for developers.

LM Studio

GUI for browsing, downloading, and chatting with local models.

Jan

Open-source desktop app for offline AI — fully self-hosted.

Key Insights: What You've Learned

GGUF files package model weights, tokenizer, and configuration into one portable file — both Ollama and LM Studio run them via llama.cpp. Q4_K_M is the recommended quantization: ~75% size reduction with minimal quality loss. Hardware tiers: 4–8 GB for Phi-4 Mini, 8–16 GB for Gemma 3 9B, 16–32 GB for Mistral Small, 40+ GB for Llama 70B.

Ollama provides a CLI and OpenAI-compatible REST API (port 11434) ideal for developers and automation; LM Studio offers a graphical interface better suited for beginners and model exploration. Both are free and run entirely on your device.

Local AI guarantees your prompts never reach the AI provider's servers — but it is not a complete security solution and does not automatically make processing of third-party personal data GDPR-compliant. Cloud AI still wins on frontier reasoning, multimodal tasks, and the latest model capabilities.

Completed

You've completed Local & Private AI!