Ollama with Intel GPU acceleration on Proxmox
Run Ollama with Intel GPU acceleration on Proxmox by exposing the iGPU to a Linux container/VM and enabling Intel’s oneAPI/SYCL or OpenVINO backends for model offloading to the Xe iGPU. Below is a clean, step‑by‑step path using a privileged Ubuntu LXC with GPU device mapping, plus an optional OpenVINO backend route if higher throughput is desired on Intel hardware.
What you’ll build
- A Proxmox host with IOMMU and i915 GuC enabled for Alder Lake Xe graphics, exposing /dev/dri to guests for GPU compute access.
- A privileged Ubuntu LXC that gets direct access to the iGPU render node, where Intel oneAPI and Ollama run with GPU offload via SYCL (IPEX‑LLM) or OpenVINO.
- Optional: a full VM with PCI passthrough if containers are not preferred, though Intel iGPU passthrough is fussier on Alder Lake, especially on Windows.
Step 1: Prep Proxmox host (IOMMU + i915 GuC)
- In BIOS/UEFI, enable VT‑d (Intel IOMMU) for device assignment; this is required for clean device mapping/passthrough.
- Enable IOMMU on Proxmox and update boot config:
- Edit GRUB:
sudo nano /etc/default/grub
# Ensure this includes:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
Then apply and reboot:
sudo update-grub
sudo reboot
Validate after reboot:
dmesg | grep -e DMAR -e IOMMU
(You should see IOMMU/DMAR enabled messages.)[^1][^4]
- Enable GuC for Intel i915 (better scheduling/perf on Xe iGPU):
echo "options i915 enable_guc=3" | sudo tee /etc/modprobe.d/i915.conf
sudo update-initramfs -u -k all
sudo reboot
After reboot, confirm the render node exists:
ls -l /dev/dri
# Expect renderD128 at minimum
Step 2: Create a privileged Ubuntu LXC and map the GPU
- Create a privileged Ubuntu 22.04/24.04 LXC in Proxmox (privileged avoids extra uid/gid mapping hurdles with GPU devices).
- Add these lines to the container’s config at /etc/pve/lxc/.conf to pass the render node and allow access:
lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file
Start the container and verify /dev/dri/renderD128 exists inside it.
- Note: This also works for unprivileged containers with the correct device rules, but privileged is simpler; if staying unprivileged, consult a known working pattern for iGPU into unprivileged LXC.
- Alternative (optional): Use a full VM with PCI passthrough of the iGPU; this can work well on Linux guests, but Windows guests on Alder Lake can throw Code 43 errors without careful tuning.
Step 3: Install Intel runtimes and Ollama (inside the LXC)
- Install Intel oneAPI Base runtime and prerequisites, then use Intel’s IPEX‑LLM integration to prime Ollama for Intel GPU offload via SYCL/Level Zero.
- In Ubuntu LXC:
sudo apt update
sudo apt install -y python3.11-venv
python3.11 -m venv ~/llm_env
source ~/llm_env/bin/activate
pip install --pre --upgrade ipex-llm[cpp]
mkdir -p ~/llama-cpp && cd ~/llama-cpp
# Initialize Ollama with Intel GPU support via IPEX-LLM helper
init-ollama
Intel’s guide uses a Python venv, installs ipex-llm[cpp], then initializes an Ollama build configured for Intel GPUs.
- Optional sanity check: ensure the container can see the Intel GPU compute stack by confirming OpenCL exposure if installed; many users validate with clinfo after installing Intel’s OpenCL runtime.
Step 4: Run Ollama accelerated on the Intel iGPU
- Set recommended environment variables for full layer offload and Level Zero behavior, then start the service:
# From inside the LXC (activate venv if needed)
export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
export SYCL_CACHE_PERSISTENT=1
# If oneAPI setvars is installed/system-wide, source it:
source /opt/intel/oneapi/setvars.sh || true
# For certain kernels/GPUs this can help:
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
# Optional: listen on all interfaces if exposing to LAN
# export OLLAMA_HOST=0.0.0.0
ollama serve
OLLAMA_NUM_GPU=999 forces all layers that can run on the GPU to offload, while the Level Zero variables improve device telemetry and command submission for Intel GPUs.
- In a second shell, pull and run a model, for example:
ollama run llama3.1:8b
The IPEX‑LLM integration steers the underlying llama.cpp execution toward Intel’s SYCL/Level Zero path when possible.
Option B: OpenVINO backend for Ollama (higher throughput path)
- Intel’s OpenVINO integration can accelerate inference on Intel CPU/iGPU/NPU and offers a dedicated backend for Ollama via OpenVINO GenAI, which can outperform generic SYCL paths in many cases.
- High‑level flow in the LXC or VM:
1) Download and initialize the OpenVINO GenAI runtime; set GODEBUG=cgocheck=0 for the Ollama executable using this backend. 2) Obtain an OpenVINO IR model (e.g., quantized DeepSeek‑R1‑Distill‑Qwen‑7B int4), then package it. 3) Write a Modelfile declaring ModelType “OpenVINO” and InferDevice “GPU,” and create the Ollama model image.
- Practical commands (example flow shown by the OpenVINO team and contributors):
# 1) Prepare OpenVINO GenAI runtime (env example)
export GODEBUG=cgocheck=0
# Source the OpenVINO/GenAI setup if provided by your runtime package
# source setupvars.sh
# 2) Download an OpenVINO IR model (example uses ModelScope tools)
pip install modelscope
modelscope download --model zhaohb/DeepSeek-R1-Distill-Qwen-7B-int4-ov --local_dir ./DeepSeek-R1-Distill-Qwen-7B-int4-ov
tar -zcvf DeepSeek-R1-Distill-Qwen-7B-int4-ov.tar.gz DeepSeek-R1-Distill-Qwen-7B-int4-ov
# 3) Modelfile (OpenVINO backend)
cat > Modelfile << 'EOF'
FROM DeepSeek-R1-Distill-Qwen-7B-int4-ov.tar.gz
ModelType "OpenVINO"
InferDevice "GPU"
PARAMETER stop ""
PARAMETER stop "```
PARAMETER stop "</User|>"
PARAMETER stop "<|end_of_sentence|>"
PARAMETER stop "</|"
PARAMETER max_new_token 4096
PARAMETER stop_id 151643
PARAMETER stop_id 151647
PARAMETER repeat_penalty 1.5
PARAMETER top_p 0.95
PARAMETER top_k 50
PARAMETER temperature 0.8
EOF
# Create and run the model with the OpenVINO backend
ollama create DeepSeek-R1-Distill-Qwen-7B-int4-ov:v1 -f Modelfile
ollama run DeepSeek-R1-Distill-Qwen-7B-int4-ov:v1
Troubleshooting and tips
- If /dev/dri/renderD128 is missing inside the container, recheck the LXC config lines and that i915 has created the render node on the host after enabling GuC.
- If choosing a full VM instead of LXC, Linux guests are typically smoother than Windows on Alder Lake; Windows guests often hit Code 43 unless carefully configured.
- Expect meaningful gains from OpenVINO on Intel iGPU versus generic SYCL; community benchmarks have reported several‑fold speedups on some 7B models, though results vary by model and kernel.
- Vulkan on Intel iGPU often underperforms versus CPU or Intel’s native stacks; prefer oneAPI/SYCL or OpenVINO backends on Intel hardware.
Why this layout
- LXC with /dev/dri mapping is the simplest, most stable way to put the iGPU to work without full PCI passthrough, and is widely used for GPU‑accelerated services on Proxmox.
- Intel’s documented IPEX‑LLM environment for Ollama is straightforward and stays close to upstream Ollama usage, while OpenVINO offers an alternative backend with strong hardware‑specific optimizations on Intel platforms.
Ins0mniA
No Comments