llama
npx machina-cli add skill G1Joshi/Agent-Skills/llama --openclawLlama
Meta Llama is the king of Open Weights models. Llama 4 (2025) pushes 405B+ parameters, rivaling closed models like GPT-5.
When to Use
- Privacy: Run it on your own VPC (AWS Bedrock, Azure, or self-hosted).
- Fine-Tuning: It is the default base model for fine-tuning on domain data.
- Cost: Inference on Groq/Together AI is significantly cheaper than GPT.
Core Concepts
Models
- 405B: Frontier intelligence. Requires massive GPU clusters (or API).
- 70B: The workhorse. Smart enough for most tasks.
- 8B: Runs on a laptop (MacBook M3).
Quantization
Running models at 4-bit or 8-bit precision to fit in VRAM with minimal quality loss (GGUF, EXL2).
Llama Stack
Standardized tooling for building agentic apps on Llama.
Best Practices (2025)
Do:
- Use via API: Groq (LPU) runs Llama Instantaneously (>1000 tok/s).
- Fine-Tune 8B: For specific tasks (classification, SQL generation), a fine-tuned 8B beats a generic 70B.
Don't:
- Don't self-host 405B: Unless you have 8xH100s. Use an API provider.
References
Source
git clone https://github.com/G1Joshi/Agent-Skills/blob/main/skills/ai-ml/llama/SKILL.mdView on GitHub Overview
Meta Llama is an open-source LLM family designed for local deployment and privacy. It includes models like 405B, 70B, and 8B, enabling on-premises fine-tuning and cost-effective inference, with quantization and a unified Llama Stack for building agentic apps.
How This Skill Works
Llama's models come in sizes 405B, 70B, and 8B. You can apply 4-bit or 8-bit quantization (GGUF, EXL2) to fit VRAM, and use standardized tooling via the Llama Stack to build agentic apps. Fine-tuning 8B is feasible for domain tasks, while large 405B models are typically API-hosted due to hardware needs.
When to Use It
- Privacy-focused deployments on your own VPC or self-hosted hardware
- Fine-tuning on domain data to customize behavior (especially with 8B)
- Cost-conscious inference using Groq/Together AI
- Laptop or on-device usage with 8B for local experimentation
- Choosing the right model size (8B, 70B, or 405B) to balance resources and capability
Quick Start
- Step 1: Pick a model size (8B for laptop, 70B for general use, 405B for top performance) and plan hardware.
- Step 2: Apply 4-bit or 8-bit quantization (GGUF, EXL2) to fit VRAM.
- Step 3: Deploy via the Llama Stack on your chosen path (self-hosted or API) depending on privacy and cost.
Best Practices
- Use via API: Groq (LPU) runs Llama instantaneously (>1000 tok/s).
- Fine-tune 8B: tailor to specific tasks (classification, SQL generation) to outperform generic 70B.
- Quantize to 4-bit or 8-bit (GGUF, EXL2) to fit VRAM with minimal quality loss.
- Leverage the Llama Stack for standardized tooling to build agentic apps on Llama.
- Don't self-host 405B unless you have 8xH100s; use an API provider.
Example Use Cases
- Privacy-preserving chat assistant deployed in a company’s private VPC using a fine-tuned 70B model.
- Domain-adapted support bot trained on internal docs via an 8B fine-tuned model.
- Cost-optimized QA bot running on Groq/Together AI with a 4-bit quantized Llama model.
- On-device personal assistant demo running 8B Llama on a laptop or MacBook M3.
- Research team comparing a locally quantized 8B versus API-hosted 405B for a data-to-text task.