How Ollama Stores and Runs Models Locally
If you use Ollama to run AI models locally, you might wonder what’s actually happening under the hood. Where are the files stored? What do they contain? And what happens when you actually run a model?
This post breaks it down.
Where Ollama Stores Its Files
On macOS/Linux, Ollama stores its data under ~/.ollama/
. Here’s a peek inside:
tree ~/.ollama/
~/.ollama/
├── history
├── id_ed25519
├── id_ed25519.pub
├── logs/
│ ├── app-*.log
│ └── server-*.log
└── models/
├── blobs/
│ ├── sha256-...
│ └── sha256-...
└── manifests/
└── registry.ollama.ai/library/
├── llama3/
│ └── latest
├── qwen3/
│ └── 8b
└── deepseek-r1/
└── 8b
manifests/
→ tiny JSON files (indexes) describing each model: metadata + which blobs to use.blobs/
→ large binary chunks (actual model weights), content-addressed by SHA-256 hash.
Together, these two pieces are the model.
Inspecting a Model
You can get model details with:
ollama show llama3
Which outputs something like:
Model
architecture llama
parameters 8.0B
context length 8192
embedding length 4096
quantization Q4_0
Capabilities
completion
License
META LLAMA 3 COMMUNITY LICENSE AGREEMENT
Release Date: April 18, 2024
A Sample Manifest
Here’s what a manifest JSON looks like:
{
"schemaVersion": 2,
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"config": {
"mediaType": "application/vnd.docker.container.image.v1+json",
"digest": "sha256:3f8eb4da...",
"size": 485
},
"layers": [
{
"mediaType": "application/vnd.ollama.image.model",
"digest": "sha256:6a0746a1...",
"size": 4661211424
},
{
"mediaType": "application/vnd.ollama.image.license",
"digest": "sha256:4fa551d4...",
"size": 12403
}
]
}
Think of the manifest as the bill of materials and the blobs as the actual parts.
What Happens When You Run a Model
Running a model is a multi-step process:
1. Resolving the Model
When you call:
ollama run llama3:8b
Ollama resolves the tag (llama3:8b
) to a manifest. If the manifest or blobs are missing locally, Ollama fetches them from the registry.
2. Mapping Blobs
The manifest lists which blobs (weights) are required. Missing blobs are downloaded and stored under models/blobs/
. Deduplication is automatic since everything is hash-addressed.
3. Loading into Memory
Ollama’s backend (based on llama.cpp
) loads the blobs into memory:
- GPU VRAM if you have a supported GPU and enough capacity.
- System RAM otherwise, with slower performance.
By default, blobs are memory-mapped (mmap
), so only the needed parts get paged into RAM/VRAM. You can disable mmap to force a full load.
4. Runtime Caches
As the model generates text, it builds a KV cache (key/value tensors of past tokens). This cache grows with context length and is usually the biggest extra memory consumer. Ollama supports quantizing the KV cache to save VRAM.
5. The Inference Loop
For each token:
- Prompt → tokens → embeddings
- Transformer layers → attention + MLP
- Next-token probabilities → sample
- KV cache updated → repeat
Through this loop, the weights stay loaded while the KV cache grows. Performance depends heavily on VRAM, quantization, and context size.
TL;DR
- Manifests = JSON indexes (model metadata + blob list).
- Blobs = actual weights (huge binary files).
- Ollama memory-maps blobs → loads into VRAM/RAM.
- KV cache grows during inference, making VRAM the key bottleneck.
So, when you run ollama run llama3
, you’re essentially:
- Resolving a manifest
- Ensuring blobs are present
- Mapping weights into memory
- Running the inference loop with a growing KV cache
That’s Ollama’s model storage & execution in a nutshell.