yous.dev

How to deploy a HuggingFace model with Ollama

A practical guide to taking an open-weight model from a HuggingFace repository to a local Ollama endpoint, ready to consume over API or CLI.

Ollama, a local inference runtime
Ollama runs open-weight models locally.

1. Background concepts

1.1 What is Ollama?

Ollama is a local inference runtime built on top of llama.cpp. It handles:

  • Loading quantized models in GGUF format.
  • Exposing a compatible HTTP API (port 11434 by default) with endpoints like /api/generate, /api/chat and /api/embeddings.
  • Managing prompt templates, sampling parameters, and model download/storage.

It installs as a service (systemd on Linux, a native app on macOS/Windows) and ships a CLI (ollama) that wraps the API.

1.2 What is GGUF?

GGUF (GPT-Generated Unified Format) is the binary format llama.cpp uses to store quantized weights, the tokenizer and model metadata in a single file. It's the format Ollama consumes directly.

1.3 Common quantizations

LabelBits/weight approx.QualityTypical use
BF16 / F1616Maximum (reference)GPU with comfortable VRAM
Q8_08Almost identical to F16Good quality/size balance
Q6_K~6.5Very highMachines with less VRAM
Q5_K_M~5.5HighFast inference
Q4_K_M~4.8AcceptableLarge models on modest hardware
Q3_K_* / Q2_K2–3DegradedOnly when there's no alternative

Rule of thumb: if the quantized model fits in VRAM, Q8_0 or Q6_K are the default choice. Drop to Q4 only when the model's size justifies it.

2. Requirements

  • A server with Ollama installed and the service running (systemctl status ollama).
  • Enough disk space (the GGUF + ~2× for Ollama's internal layers during create).
  • Internet access to download from HuggingFace.
  • Optional: an NVIDIA GPU with drivers + CUDA. Ollama detects and uses the GPU automatically.

Quick checks:

ollama --version
ollama list                  # models already registered
curl http://localhost:11434  # should respond "Ollama is running"
df -h                        # available space

3. Download the GGUF from HuggingFace

3.1 Locate the file

On the model page, in the Files and versions tab, look for the .gguf files. You'll see something like:

my-model-7B-BF16.gguf       14 GB
my-model-7B-Q8_0.gguf        7.5 GB   ← recommended
my-model-7B-Q6_K.gguf        5.8 GB
my-model-7B-Q4_K_M.gguf      4.4 GB

Each file exposes a resolve URL that serves the binary directly:

https://huggingface.co/<user>/<repo>/resolve/main/<file>.gguf

3.2 Direct download with wget

mkdir -p ~/models/my-model-7B
cd ~/models/my-model-7B

wget -c \
  -O my-model-7B-Q8_0.gguf \
  "https://huggingface.co/user/repo/resolve/main/my-model-7B-Q8_0.gguf?download=true"

-c lets you resume if the download breaks. For private repos add --header="Authorization: Bearer hf_xxxxxxxx".

3.3 Alternative: huggingface-cli

pip install -U huggingface_hub
huggingface-cli download user/repo my-model-7B-Q8_0.gguf \
  --local-dir ~/models/my-model-7B --local-dir-use-symlinks False

3.4 What if the repo has no GGUF?

Some repos only publish the original weights (HuggingFace Transformers .safetensors). You have to convert and quantize to GGUF. The cleanest way is to use the llama.cpp Docker image so you don't pollute the host:

# Convert HF -> GGUF F16
docker run --rm -v ~/models:/models ghcr.io/ggml-org/llama.cpp:full \
  --convert /models/my-model-hf-src \
  --outfile /models/my-model-7B/my-model-7B-F16.gguf \
  --outtype f16

# Quantize F16 -> Q8_0
docker run --rm -v ~/models:/models ghcr.io/ggml-org/llama.cpp:full \
  --quantize /models/my-model-7B/my-model-7B-F16.gguf \
            /models/my-model-7B/my-model-7B-Q8_0.gguf Q8_0

Because the image runs with --rm, the container is cleaned up automatically and nothing residual is left outside the mounted directory.

4. The Modelfile

A Modelfile is the "recipe" Ollama uses to register a model. It defines where to get the weights, which chat template to apply, which default parameters to set, and which system prompt to ship pre-loaded. It's the conceptual equivalent of a Dockerfile, but for models.

4.1 Structure

# (1) Weights source
FROM /absolute/path/to/model.gguf

# (2) Prompt template
TEMPLATE """..."""

# (3) Sampling and context parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER stop "<|im_end|>"

# (4) Default system prompt (optional)
SYSTEM """You are a helpful, concise assistant."""

# (5) LoRA adapters (optional)
# ADAPTER /path/to/adapter.gguf

# (6) Embedded license (optional)
# LICENSE """..."""

4.2 Relevant directives

DirectiveFunction
FROMPath to the GGUF, or the name of another model already in Ollama (FROM llama3:8b).
TEMPLATEGo-template that wraps the prompt with special tokens.
RENDERER / PARSER(Modern Ollama) Native renderers for known architectures — avoids writing the template by hand.
PARAMETERTunes sampling (temperature, top_p, top_k, min_p, repeat_penalty), context (num_ctx), and stop tokens (stop).
SYSTEMDefault system prompt if the request doesn't provide one.
ADAPTERApplies a LoRA on top of the base model.
MESSAGEAdds few-shot examples the model sees at the start.

5. Prompt templates: why they matter

Each model family expects its messages wrapped in specific special tokens. If the template doesn't match:

  • The model doesn't know where each turn starts/ends → erratic responses.
  • It generates the wrong end token → it never stops, or stops mid-sentence.
  • It ignores the system prompt → it doesn't respect the role.

Here are the most widespread formats.

5.1 ChatML (Qwen, OpenAI style)

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hi<|im_end|>
<|im_start|>assistant

Typical Modelfile:

FROM /models/qwen-7b-q8_0.gguf

TEMPLATE """{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"

5.2 Llama 3 / Llama 4

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are helpful.<|eot_id|><|start_header_id|>user<|end_header_id|>

Hi<|eot_id|><|start_header_id|>assistant<|end_header_id|>
TEMPLATE """<|begin_of_text|>{{- if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
PARAMETER stop "<|eot_id|>"

5.3 Mistral / Mixtral ([INST] format)

<s>[INST] You are helpful.

Hi [/INST]
TEMPLATE """[INST] {{ if .System }}{{ .System }}

{{ end }}{{ .Prompt }} [/INST]
"""
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"

5.4 Gemma

<start_of_turn>user
Hi<end_of_turn>
<start_of_turn>model
TEMPLATE """<start_of_turn>user
{{ if .System }}{{ .System }}

{{ end }}{{ .Prompt }}<end_of_turn>
<start_of_turn>model
"""
PARAMETER stop "<start_of_turn>"
PARAMETER stop "<end_of_turn>"

5.5 Phi-3

<|system|>
You are helpful.<|end|>
<|user|>
Hi<|end|>
<|assistant|>

5.6 Native RENDERERs (a shortcut in modern Ollama)

Recent Ollama versions ship native renderers and parsers for specific architectures. Instead of writing the template, you declare:

FROM /models/model-x.gguf

TEMPLATE {{ .Prompt }}
RENDERER architecture-name
PARSER   architecture-name

PARAMETER temperature 0.7
PARAMETER top_p 0.8

The renderer injects the correct tokens, and the parser separates thinking from content in reasoning models. When an official renderer exists for the model's family, this is the most robust option and the one I recommend.

To discover which template any already-registered model uses:

ollama show --modelfile <model>

It's the best source of truth: copy and adapt from a model of the same family that already works.

6. Create the model in Ollama

With the GGUF downloaded and the Modelfile_mymodel ready:

ollama create mymodel:7b-q8_0 -f Modelfile_mymodel

What happens under the hood:

  1. Ollama copies the GGUF into its blob store (~/.ollama/models/blobs/ or /usr/share/ollama/.ollama/models/blobs/ depending on the install), named by its SHA-256.
  2. Generates additional layers with the template, parameters and system prompt.
  3. Writes a manifest tying everything under the tag mymodel:7b-q8_0.

Important: the copy temporarily doubles disk usage. Once created, you can delete the original GGUF (Ollama already has its copy in the blob store).

Verify:

ollama list
ollama show mymodel:7b-q8_0

7. Test the model

7.1 Interactive CLI

ollama run mymodel:7b-q8_0
>>> Hi, how are you?

7.2 REST API (/api/chat)

curl http://localhost:11434/api/chat -d '{
  "model": "mymodel:7b-q8_0",
  "messages": [
    {"role": "system", "content": "Always answer in English."},
    {"role": "user",   "content": "Summarize what GGUF is in one sentence."}
  ],
  "stream": false
}'

7.3 Raw generation API (/api/generate)

Useful for debugging the template — pass "raw": true and send the special tokens yourself:

curl http://localhost:11434/api/generate -d '{
  "model": "mymodel:7b-q8_0",
  "prompt": "<|im_start|>user\nHi<|im_end|>\n<|im_start|>assistant\n",
  "raw": true,
  "stream": false
}'

7.4 Reasoning models (thinking)

Modern reasoning families (Qwen3.6, DeepSeek-R1, etc.) emit a thinking trace separate from the final content. The API exposes it in the message.thinking field:

{
  "message": {
    "role": "assistant",
    "thinking": "Let's work through it step by step...",
    "content": "The result is 42."
  }
}

To disable the trace on requests where you only want the answer:

curl http://localhost:11434/api/chat -d '{
  "model": "mymodel:7b-q8_0",
  "messages": [{"role": "user", "content": "2+2"}],
  "think": false
}'

8. Expose the service to the network

By default Ollama listens only on 127.0.0.1. To expose it on the LAN, edit the systemd override:

sudo systemctl edit ollama

And add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Reload:

sudo systemctl daemon-reload
sudo systemctl restart ollama
⚠️ Ollama ships no authentication. If you expose it beyond a trusted network, put a reverse proxy (nginx/Caddy/Traefik) in front with TLS and auth (Basic, an OAuth proxy, mTLS — whatever fits).

Useful variables:

VariableWhat for
OLLAMA_HOSTListen address/port.
OLLAMA_MODELSBlob store directory (handy to move it to another disk).
OLLAMA_KEEP_ALIVEHow long to keep the model loaded in VRAM (5m, 1h, -1 forever).
OLLAMA_NUM_PARALLELConcurrent requests per model.
OLLAMA_MAX_LOADED_MODELSSimultaneous models held in memory.

9. Maintenance

ollama list                       # what's registered
ollama ps                         # what's loaded in VRAM right now
ollama show --modelfile <m>        # see a model's recipe
ollama cp old:tag new:tag         # duplicate/rename
ollama rm <model>                  # delete a model and free disk
journalctl -u ollama -f           # live logs

To iterate on the Modelfile (change the template, tweak temperature, etc.) just edit it and run ollama create again with the same tag — it overwrites.

10. Common errors and how to diagnose them

The model won't stop generating. A PARAMETER stop with the correct end token is missing, or the template doesn't close the turn with the token the model learned. Check the repo's tokenizer_config.json to confirm the eos_token.

Empty or garbled responses. The template isn't the one for the model's family. Pull the Modelfile from another model in the same family (ollama show --modelfile) and use it as a reference.

Error: invalid file magic. The GGUF is corrupt or incomplete. Resume the download with wget -c and compare the size against what HuggingFace shows.

Error: model requires more system memory than is available. It doesn't fit in VRAM. Lower the quantization, reduce num_ctx, or let Ollama offload layers to the CPU (slower, but it works).

The model loads but the GPU sits at 0%. NVIDIA drivers misaligned with the CUDA version. nvidia-smi should run clean; if it reports "Driver/library version mismatch", reboot or reinstall the drivers.

Hangs on the first request. Ollama is loading the model into memory. The initial latency can be several seconds; later loads are instant while the model stays in VRAM.

11. Flow summary

┌─────────────────────┐
│  HuggingFace repo   │
│  (.gguf quantized)  │
└──────────┬──────────┘
           │  wget / huggingface-cli
           ▼
┌─────────────────────┐
│  ~/models/<model>/   │
│  file.gguf          │
└──────────┬──────────┘
           │
           │  + Modelfile (TEMPLATE, PARAMETER, SYSTEM)
           ▼
┌─────────────────────┐
│  ollama create      │ ──► blob store + manifest
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  ollama run / API   │   :11434/api/chat
└─────────────────────┘

Three files, three commands: download, write the Modelfile, ollama create. The rest is knowing which template your model wants and tuning the sampling parameters to the use case.

← back to field notes