Skip to main content

Ollama with Intel GPU acceleration on Proxmox

Run Ollama with Intel GPU acceleration on Proxmox by exposing the iGPU to a Linux container/VM and enabling Intel’s oneAPI/SYCL or OpenVINO backends for model offloading to the Xe iGPU. Below is a clean, step‑by‑step path using a privileged Ubuntu LXC with GPU device mapping, plus an optional OpenVINO backend route if higher throughput is desired on Intel hardware.

What you’ll build

  • A Proxmox host with IOMMU and i915 GuC enabled for Alder Lake Xe graphics, exposing /dev/dri to guests for GPU compute access.
  • A privileged Ubuntu LXC that gets direct access to the iGPU render node, where Intel oneAPI and Ollama run with GPU offload via SYCL (IPEX‑LLM) or OpenVINO.
  • Optional: a full VM with PCI passthrough if containers are not preferred, though Intel iGPU passthrough is fussier on Alder Lake, especially on Windows.
Step 1: Prep Proxmox host (IOMMU + i915 GuC)
  • In BIOS/UEFI, enable VT‑d (Intel IOMMU) for device assignment; this is required for clean device mapping/passthrough.
  • Enable IOMMU on Proxmox and update boot config:
    • Edit GRUB:
sudo nano /etc/default/grub
# Ensure this includes:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"

Then apply and reboot:

sudo update-grub
sudo reboot

Validate after reboot:

dmesg | grep -e DMAR -e IOMMU

(You should see IOMMU/DMAR enabled messages.)[^1][^4]

  • Enable GuC for Intel i915 (better scheduling/perf on Xe iGPU):
echo "options i915 enable_guc=3" | sudo tee /etc/modprobe.d/i915.conf
sudo update-initramfs -u -k all
sudo reboot

After reboot, confirm the render node exists:

ls -l /dev/dri
# Expect renderD128 at minimum
Step 2: Create a privileged Ubuntu LXC and map the GPU
  • Create a privileged Ubuntu 22.04/24.04 LXC in Proxmox (privileged avoids extra uid/gid mapping hurdles with GPU devices).
  • Add these lines to the container’s config at /etc/pve/lxc/.conf to pass the render node and allow access:
lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file

Start the container and verify /dev/dri/renderD128 exists inside it.

  • Note: This also works for unprivileged containers with the correct device rules, but privileged is simpler; if staying unprivileged, consult a known working pattern for iGPU into unprivileged LXC.
  • Alternative (optional): Use a full VM with PCI passthrough of the iGPU; this can work well on Linux guests, but Windows guests on Alder Lake can throw Code 43 errors without careful tuning.
Step 3: Install Intel runtimes and Ollama (inside the LXC)
  • Install Intel oneAPI Base runtime and prerequisites, then use Intel’s IPEX‑LLM integration to prime Ollama for Intel GPU offload via SYCL/Level Zero.
  • In Ubuntu LXC:
sudo apt update
sudo apt install -y python3.11-venv
python3.11 -m venv ~/llm_env
source ~/llm_env/bin/activate
pip install --pre --upgrade ipex-llm[cpp]
mkdir -p ~/llama-cpp && cd ~/llama-cpp
# Initialize Ollama with Intel GPU support via IPEX-LLM helper
init-ollama

Intel’s guide uses a Python venv, installs ipex-llm[cpp], then initializes an Ollama build configured for Intel GPUs.

  • Optional sanity check: ensure the container can see the Intel GPU compute stack by confirming OpenCL exposure if installed; many users validate with clinfo after installing Intel’s OpenCL runtime.
Step 4: Run Ollama accelerated on the Intel iGPU
# From inside the LXC (activate venv if needed)
export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
export SYCL_CACHE_PERSISTENT=1
# If oneAPI setvars is installed/system-wide, source it:
source /opt/intel/oneapi/setvars.sh || true
# For certain kernels/GPUs this can help:
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
# Optional: listen on all interfaces if exposing to LAN
# export OLLAMA_HOST=0.0.0.0
ollama serve

OLLAMA_NUM_GPU=999 forces all layers that can run on the GPU to offload, while the Level Zero variables improve device telemetry and command submission for Intel GPUs.

  • In a second shell, pull and run a model, for example:
ollama run llama3.1:8b

The IPEX‑LLM integration steers the underlying llama.cpp execution toward Intel’s SYCL/Level Zero path when possible.

Option B: OpenVINO backend for Ollama (higher throughput path)
  • Intel’s OpenVINO integration can accelerate inference on Intel CPU/iGPU/NPU and offers a dedicated backend for Ollama via OpenVINO GenAI, which can outperform generic SYCL paths in many cases.
  • High‑level flow in the LXC or VM:

1) Download and initialize the OpenVINO GenAI runtime; set GODEBUG=cgocheck=0 for the Ollama executable using this backend. 2) Obtain an OpenVINO IR model (e.g., quantized DeepSeek‑R1‑Distill‑Qwen‑7B int4), then package it. 3) Write a Modelfile declaring ModelType “OpenVINO” and InferDevice “GPU,” and create the Ollama model image.

  • Practical commands (example flow shown by the OpenVINO team and contributors):
# 1) Prepare OpenVINO GenAI runtime (env example)
export GODEBUG=cgocheck=0
# Source the OpenVINO/GenAI setup if provided by your runtime package
# source setupvars.sh

# 2) Download an OpenVINO IR model (example uses ModelScope tools)
pip install modelscope
modelscope download --model zhaohb/DeepSeek-R1-Distill-Qwen-7B-int4-ov --local_dir ./DeepSeek-R1-Distill-Qwen-7B-int4-ov
tar -zcvf DeepSeek-R1-Distill-Qwen-7B-int4-ov.tar.gz DeepSeek-R1-Distill-Qwen-7B-int4-ov

# 3) Modelfile (OpenVINO backend)
cat > Modelfile << 'EOF'
FROM DeepSeek-R1-Distill-Qwen-7B-int4-ov.tar.gz
ModelType "OpenVINO"
InferDevice "GPU"
PARAMETER stop ""
PARAMETER stop "```
PARAMETER stop "</User|>"
PARAMETER stop "<|end_of_sentence|>"
PARAMETER stop "</|"
PARAMETER max_new_token 4096
PARAMETER stop_id 151643
PARAMETER stop_id 151647
PARAMETER repeat_penalty 1.5
PARAMETER top_p 0.95
PARAMETER top_k 50
PARAMETER temperature 0.8
EOF

# Create and run the model with the OpenVINO backend
ollama create DeepSeek-R1-Distill-Qwen-7B-int4-ov:v1 -f Modelfile
ollama run DeepSeek-R1-Distill-Qwen-7B-int4-ov:v1

Troubleshooting and tips

  • If /dev/dri/renderD128 is missing inside the container, recheck the LXC config lines and that i915 has created the render node on the host after enabling GuC.
  • If choosing a full VM instead of LXC, Linux guests are typically smoother than Windows on Alder Lake; Windows guests often hit Code 43 unless carefully configured.
  • Expect meaningful gains from OpenVINO on Intel iGPU versus generic SYCL; community benchmarks have reported several‑fold speedups on some 7B models, though results vary by model and kernel.
  • Vulkan on Intel iGPU often underperforms versus CPU or Intel’s native stacks; prefer oneAPI/SYCL or OpenVINO backends on Intel hardware.

Why this layout

  • LXC with /dev/dri mapping is the simplest, most stable way to put the iGPU to work without full PCI passthrough, and is widely used for GPU‑accelerated services on Proxmox.
  • Intel’s documented IPEX‑LLM environment for Ollama is straightforward and stays close to upstream Ollama usage, while OpenVINO offers an alternative backend with strong hardware‑specific optimizations on Intel platforms.

Ins0mniA