What hardware is supported by llama-cpp?

It runs on CPU-only systems, Apple Silicon (M1/M2/M3), and AMD/Intel GPUs without CUDA.

How do I choose quantization bits for GGUF?

GGUF supports 1.5-8 bit; start with Q4_K_M (4-bit, balanced). Lower bits increase speed but reduce quality; higher bits improve quality but use more memory.

How do I run a server or API for indoor tools?

Use llama-server to expose an OpenAI-compatible endpoint; specify host/port and optional offload with -ngl; test requests with curl.

llama-cpp

Scanned

Inference Serving Llama.cpp CPU Inference Apple Silicon Edge Deployment GGUF Quantization Non-NVIDIA AMD GPUs Intel GPUs

npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/llama-cpp --openclaw

Files (1)

SKILL.md

5.8 KB

llama.cpp

Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.

When to use llama.cpp

Use llama.cpp when:

Running on CPU-only machines
Deploying on Apple Silicon (M1/M2/M3/M4)
Using AMD or Intel GPUs (no CUDA)
Edge deployment (Raspberry Pi, embedded systems)
Need simple deployment without Docker/Python

Use TensorRT-LLM instead when:

Have NVIDIA GPUs (A100/H100)
Need maximum throughput (100K+ tok/s)
Running in datacenter with CUDA

Use vLLM instead when:

Have NVIDIA GPUs
Need Python-first API
Want PagedAttention

Quick start

Installation

# macOS/Linux
brew install llama.cpp

# Or build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# With Metal (Apple Silicon)
make LLAMA_METAL=1

# With CUDA (NVIDIA)
make LLAMA_CUDA=1

# With ROCm (AMD)
make LLAMA_HIP=1

Download model

# Download from HuggingFace (GGUF format)
huggingface-cli download \
    TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf \
    --local-dir models/

# Or convert from HuggingFace
python convert_hf_to_gguf.py models/llama-2-7b-chat/

Run inference

# Simple chat
./llama-cli \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    -p "Explain quantum computing" \
    -n 256  # Max tokens

# Interactive chat
./llama-cli \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    --interactive

Server mode

# Start OpenAI-compatible server
./llama-server \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 32  # Offload 32 layers to GPU

# Client request
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b-chat",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Quantization formats

GGUF format overview

Format	Bits	Size (7B)	Speed	Quality	Use Case
Q4_K_M	4.5	4.1 GB	Fast	Good	Recommended default
Q4_K_S	4.3	3.9 GB	Faster	Lower	Speed critical
Q5_K_M	5.5	4.8 GB	Medium	Better	Quality critical
Q6_K	6.5	5.5 GB	Slower	Best	Maximum quality
Q8_0	8.0	7.0 GB	Slow	Excellent	Minimal degradation
Q2_K	2.5	2.7 GB	Fastest	Poor	Testing only

Choosing quantization

# General use (balanced)
Q4_K_M  # 4-bit, medium quality

# Maximum speed (more degradation)
Q2_K or Q3_K_M

# Maximum quality (slower)
Q6_K or Q8_0

# Very large models (70B, 405B)
Q3_K_M or Q4_K_S  # Lower bits to fit in memory

Hardware acceleration

Apple Silicon (Metal)

# Build with Metal
make LLAMA_METAL=1

# Run with GPU acceleration (automatic)
./llama-cli -m model.gguf -ngl 999  # Offload all layers

# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)

NVIDIA GPUs (CUDA)

# Build with CUDA
make LLAMA_CUDA=1

# Offload layers to GPU
./llama-cli -m model.gguf -ngl 35  # Offload 35/40 layers

# Hybrid CPU+GPU for large models
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20  # GPU: 20 layers, CPU: rest

AMD GPUs (ROCm)

# Build with ROCm
make LLAMA_HIP=1

# Run with AMD GPU
./llama-cli -m model.gguf -ngl 999

Common patterns

Batch processing

# Process multiple prompts from file
cat prompts.txt | ./llama-cli \
    -m model.gguf \
    --batch-size 512 \
    -n 100

Constrained generation

# JSON output with grammar
./llama-cli \
    -m model.gguf \
    -p "Generate a person: " \
    --grammar-file grammars/json.gbnf

# Outputs valid JSON only

Context size

# Increase context (default 512)
./llama-cli \
    -m model.gguf \
    -c 4096  # 4K context window

# Very long context (if model supports)
./llama-cli -m model.gguf -c 32768  # 32K context

Performance benchmarks

CPU performance (Llama 2-7B Q4_K_M)

CPU	Threads	Speed	Cost
Apple M3 Max	16	50 tok/s	$0 (local)
AMD Ryzen 9 7950X	32	35 tok/s	$0.50/hour
Intel i9-13900K	32	30 tok/s	$0.40/hour
AWS c7i.16xlarge	64	40 tok/s	$2.88/hour

GPU acceleration (Llama 2-7B Q4_K_M)

GPU	Speed	vs CPU	Cost
NVIDIA RTX 4090	120 tok/s	3-4×	$0 (local)
NVIDIA A10	80 tok/s	2-3×	$1.00/hour
AMD MI250	70 tok/s	2×	$2.00/hour
Apple M3 Max (Metal)	50 tok/s	~Same	$0 (local)

Supported models

LLaMA family:

Llama 2 (7B, 13B, 70B)
Llama 3 (8B, 70B, 405B)
Code Llama

Mistral family:

Mistral 7B
Mixtral 8x7B, 8x22B

Other:

Falcon, BLOOM, GPT-J
Phi-3, Gemma, Qwen
LLaVA (vision), Whisper (audio)

Find models: https://huggingface.co/models?library=gguf

References

Quantization Guide - GGUF formats, conversion, quality comparison
Server Deployment - API endpoints, Docker, monitoring
Optimization - Performance tuning, hybrid CPU+GPU

Resources

GitHub: https://github.com/ggerganov/llama.cpp
Models: https://huggingface.co/models?library=gguf
Discord: https://discord.gg/llama-cpp

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/12-inference-serving/llama-cpp/SKILL.mdView on GitHub

Overview

llama.cpp provides pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware. It shines for edge deployments on Macs (M1/M2/M3) and on AMD/Intel GPUs, and it supports GGUF quantization (1.5-8 bit) to reduce memory and deliver 4–10× speedups versus PyTorch on CPU.

How This Skill Works

Built as a pure C/C++ inference engine, llama.cpp runs models on CPU or non-NVIDIA GPUs without CUDA. It supports GGUF quantization to shrink memory footprints and accelerate performance, and is controllable via build flags (LLAMA_METAL for Apple Silicon, LLAMA_CUDA for NVIDIA, LLAMA_HIP for AMD). It also provides a server mode (llama-server) and a lightweight Python binding (llama-cpp-python) for integration.

When to Use It

Running on CPU-only machines (no CUDA)
Deploying on Apple Silicon (M1/M2/M3)
Using AMD or Intel GPUs without CUDA
Edge deployment on Raspberry Pi or embedded systems
Desire a simple deployment without Docker or Python

Quick Start

Step 1: Install or build llama.cpp (brew install llama.cpp on macOS/Linux or clone the repo and run make; enable Metal/CUDA/ROCm as needed)
Step 2: Download or convert a GGUF model (huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf or run convert_hf_to_gguf.py)
Step 3: Run a quick test (llama-cli -m models/llama-2-7b-chat.Q4_K_M.gguf -p 'Explain quantum computing' -n 256) or start the server (llama-server ...)

Best Practices

Start with GGUF Q4_K_M (4-bit, balanced) as the default to balance memory, speed, and quality
Choose quantization bits to trade memory, speed, and quality (Q2_K, Q4_K_M, Q6_K, Q8_0, etc.)
Build with the correct hardware flags: LLAMA_METAL for Apple Silicon; LLAMA_CUDA for NVIDIA; LLAMA_HIP for AMD
Use llama-server for an OpenAI-compatible API endpoint to service multiple clients
Benchmark different model sizes and quantizations to fit RAM and latency constraints

Example Use Cases

Running Llama-2-7B-GGUF on a MacBook Pro with Apple Silicon acceleration
CPU-only edge deployment on a Raspberry Pi using GGUF 4-bit quantization
Inference on AMD/Intel GPUs without CUDA using ROCm/hip offload
Serving a local chat API via llama-server for internal tools
Converting HuggingFace models to GGUF and validating performance on CPU