QVAC · Local AI Lens

Local AI inference engine

llama.cpp

The C/C++ foundation for local LLM inference — what most local AI tools build on.

Axes (0–100)

Local control / custody94
Open stack (models & tooling)96
Regulatory posture (curated)35
Interoperability78

Last reviewed: 2026-04-10

Facts (curated)

Focus: Minimal-dependency C/C++ LLM inference engine. llama-cli for chat, llama-server for OpenAI-compatible API.
Ecosystem: The engine underneath Ollama, LM Studio, LocalAI, and many others. GGUF is the de facto local model format.
Backing: Created by Georgi Gerganov. The ggml team joined Hugging Face in early 2026. MIT licensed.

Pros

Widest hardware support (CUDA, Vulkan, Metal, CPU) and best raw performance. The foundation everything else builds on.
- llama.cpp on GitHub

Cons / risks

Requires building from source or manual setup for full control. Higher barrier to entry than wrapper tools like Ollama.
- llama.cpp README

Related links

llama.cpp — LLM inference in C/C++
GitHub · 2026-04-09