Ollama with Intel GPU acceleration on Proxmox Run Ollama with Intel GPU acceleration on Proxmox by exposing the iGPU to a Linux container/VM and enabling Intel’s oneAPI/SYCL or OpenVINO backends for model offloading to the Xe iGPU. Below is a clean, step‑by‑step path using a privileged Ubuntu LXC with GPU device mapping, plus an optional OpenVINO backend route if higher throughput is desired on Intel hardware. What you’ll build A Proxmox host with IOMMU and i915 GuC enabled for Alder Lake Xe graphics, exposing /dev/dri to guests for GPU compute access. A privileged Ubuntu LXC that gets direct access to the iGPU render node, where Intel oneAPI and Ollama run with GPU offload via SYCL (IPEX‑LLM) or OpenVINO. Optional: a full VM with PCI passthrough if containers are not preferred, though Intel iGPU passthrough is fussier on Alder Lake, especially on Windows. Step 1: Prep Proxmox host (IOMMU + i915 GuC) In BIOS/UEFI, enable VT‑d (Intel IOMMU) for device assignment; this is required for clean device mapping/passthrough. Enable IOMMU on Proxmox and update boot config: Edit GRUB: sudo nano /etc/default/grub # Ensure this includes: GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt" Then apply and reboot: sudo update-grub sudo reboot Validate after reboot: dmesg | grep -e DMAR -e IOMMU (You should see IOMMU/DMAR enabled messages.)[^1][^4] Enable GuC for Intel i915 (better scheduling/perf on Xe iGPU): echo "options i915 enable_guc=3" | sudo tee /etc/modprobe.d/i915.conf sudo update-initramfs -u -k all sudo reboot After reboot, confirm the render node exists: ls -l /dev/dri # Expect renderD128 at minimum Step 2: Create a privileged Ubuntu LXC and map the GPU Create a privileged Ubuntu 22.04/24.04 LXC in Proxmox (privileged avoids extra uid/gid mapping hurdles with GPU devices). Add these lines to the container’s config at /etc/pve/lxc/.conf to pass the render node and allow access: lxc.cgroup2.devices.allow: c 226:0 rwm lxc.cgroup2.devices.allow: c 226:128 rwm lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file Start the container and verify /dev/dri/renderD128 exists inside it. Note: This also works for unprivileged containers with the correct device rules, but privileged is simpler; if staying unprivileged, consult a known working pattern for iGPU into unprivileged LXC. Alternative (optional): Use a full VM with PCI passthrough of the iGPU; this can work well on Linux guests, but Windows guests on Alder Lake can throw Code 43 errors without careful tuning. Step 3: Install Intel runtimes and Ollama (inside the LXC) Install Intel oneAPI Base runtime and prerequisites, then use Intel’s IPEX‑LLM integration to prime Ollama for Intel GPU offload via SYCL/Level Zero. In Ubuntu LXC: sudo apt update sudo apt install -y python3.11-venv python3.11 -m venv ~/llm_env source ~/llm_env/bin/activate pip install --pre --upgrade ipex-llm[cpp] mkdir -p ~/llama-cpp && cd ~/llama-cpp # Initialize Ollama with Intel GPU support via IPEX-LLM helper init-ollama Intel’s guide uses a Python venv, installs ipex-llm[cpp], then initializes an Ollama build configured for Intel GPUs. Optional sanity check: ensure the container can see the Intel GPU compute stack by confirming OpenCL exposure if installed; many users validate with clinfo after installing Intel’s OpenCL runtime. Step 4: Run Ollama accelerated on the Intel iGPU Set recommended environment variables for full layer offload and Level Zero behavior, then start the service: # From inside the LXC (activate venv if needed) export OLLAMA_NUM_GPU=999 export no_proxy=localhost,127.0.0.1 export ZES_ENABLE_SYSMAN=1 export SYCL_CACHE_PERSISTENT=1 # If oneAPI setvars is installed/system-wide, source it: source /opt/intel/oneapi/setvars.sh || true # For certain kernels/GPUs this can help: export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 # Optional: listen on all interfaces if exposing to LAN # export OLLAMA_HOST=0.0.0.0 ollama serve OLLAMA_NUM_GPU=999 forces all layers that can run on the GPU to offload, while the Level Zero variables improve device telemetry and command submission for Intel GPUs. In a second shell, pull and run a model, for example: ollama run llama3.1:8b The IPEX‑LLM integration steers the underlying llama.cpp execution toward Intel’s SYCL/Level Zero path when possible. Option B: OpenVINO backend for Ollama (higher throughput path) Intel’s OpenVINO integration can accelerate inference on Intel CPU/iGPU/NPU and offers a dedicated backend for Ollama via OpenVINO GenAI, which can outperform generic SYCL paths in many cases. High‑level flow in the LXC or VM: 1) Download and initialize the OpenVINO GenAI runtime; set GODEBUG=cgocheck=0 for the Ollama executable using this backend. 2) Obtain an OpenVINO IR model (e.g., quantized DeepSeek‑R1‑Distill‑Qwen‑7B int4), then package it. 3) Write a Modelfile declaring ModelType “OpenVINO” and InferDevice “GPU,” and create the Ollama model image. Practical commands (example flow shown by the OpenVINO team and contributors): # 1) Prepare OpenVINO GenAI runtime (env example) export GODEBUG=cgocheck=0 # Source the OpenVINO/GenAI setup if provided by your runtime package # source setupvars.sh # 2) Download an OpenVINO IR model (example uses ModelScope tools) pip install modelscope modelscope download --model zhaohb/DeepSeek-R1-Distill-Qwen-7B-int4-ov --local_dir ./DeepSeek-R1-Distill-Qwen-7B-int4-ov tar -zcvf DeepSeek-R1-Distill-Qwen-7B-int4-ov.tar.gz DeepSeek-R1-Distill-Qwen-7B-int4-ov # 3) Modelfile (OpenVINO backend) cat > Modelfile << 'EOF' FROM DeepSeek-R1-Distill-Qwen-7B-int4-ov.tar.gz ModelType "OpenVINO" InferDevice "GPU" PARAMETER stop "" PARAMETER stop "``` PARAMETER stop "" PARAMETER stop "<|end_of_sentence|>" PARAMETER stop "