Llama Docs & Prompt Guide
Everything you need for Meta's open-source Llama models. Chat templates, system prompts, local deployment with Ollama, fine-tuning with LoRA, and production serving.
Official Meta docs →Why Llama is different
Unlike every other provider on this list, Llama is open-source. That means this guide covers not just prompting but also setup, deployment, and fine-tuning. You own the model. You can run it on your hardware, customize it for your domain, and never pay per-token. Access Llama through cloud providers (AWS, Azure, Google Cloud), hosted endpoints (Together AI, Groq, Fireworks), or locally on your own machine.
Chat templates: get this right first
This is the number one reason people get bad results from local Llama models. Llama uses specific chat templates with special tokens that mark system messages, user turns, and assistant turns. If the template is wrong, the model sees garbage and outputs garbage. Llama 4 uses <|begin_of_text|>, <|start_header_id|>, and <|end_header_id|> tokens. Always verify you're using the correct template for your exact model version.
System prompts: keep them short
Llama respects system prompts but handles them differently than cloud API models. The key difference: keep system prompts concise. Llama models perform noticeably better with shorter, focused instructions than with long system prompts. Define the role, specify output format, set behavioral constraints, and stop. Don't add paragraphs of context that could go in the user message instead.
Running Llama locally
Multiple options depending on your needs: Ollama for the easiest setup (one command install, handles everything), llama.cpp for CPU inference on machines without GPUs, vLLM for production serving with high throughput, and TGI (Text Generation Inference) for containerized deployments. Quantization lets you run larger models on smaller hardware: 4-bit (Q4) fits more models in less VRAM, 8-bit (Q8) keeps more quality.
Fine-tuning with LoRA
This is where open-source models shine. Fine-tune Llama for your specific domain with LoRA (Low-Rank Adaptation) or QLoRA (quantized LoRA for less VRAM). Start with a small dataset (hundreds of examples), not thousands. Use LoRA for parameter-efficient training that takes hours instead of days. Evaluate on held-out test sets to confirm improvement. The result: a model that speaks your domain's language natively.
Production deployment
Three paths to production: Cloud (AWS SageMaker, Azure ML, GCP Vertex AI) for managed infrastructure, Hosted endpoints (Together AI, Fireworks, Groq) for API simplicity at lower cost than OpenAI, or Self-hosted (vLLM, TGI) for full control. Trade-offs: cloud is easiest, hosted is cheapest for moderate traffic, self-hosted is cheapest at scale and gives you complete data privacy.