aiwithgrant
about me
Meta
Meta
Official Docs
Intermediate

Llama Docs & Prompt Guide

Everything you need for Meta's open-source Llama models. Chat templates, system prompts, local deployment with Ollama, fine-tuning with LoRA, and production serving.

Official Meta docs →
Content sourced from official Meta documentation
1

Why Llama is different

Unlike every other provider on this list, Llama is open-source. That means this guide covers not just prompting but also setup, deployment, and fine-tuning. You own the model. You can run it on your hardware, customize it for your domain, and never pay per-token. Access Llama through cloud providers (AWS, Azure, Google Cloud), hosted endpoints (Together AI, Groq, Fireworks), or locally on your own machine.

💡If you're just experimenting, use a hosted endpoint like Together AI or Groq. They offer free tiers and the setup is identical to using OpenAI's API. Go local when you need privacy or cost control at scale.
2

Chat templates: get this right first

This is the number one reason people get bad results from local Llama models. Llama uses specific chat templates with special tokens that mark system messages, user turns, and assistant turns. If the template is wrong, the model sees garbage and outputs garbage. Llama 4 uses <|begin_of_text|>, <|start_header_id|>, and <|end_header_id|> tokens. Always verify you're using the correct template for your exact model version.

💡If you're using Ollama, it handles templates automatically. If you're using llama.cpp or vLLM, check the model card on Hugging Face for the exact template format.
Llama 4 chat template
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful coding assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

Write a Python function to check if a number is prime.<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Wrong template = bad output. This is the Llama 4 format. Llama 2 and 3 use different formats. Always check your model version.
3

System prompts: keep them short

Llama respects system prompts but handles them differently than cloud API models. The key difference: keep system prompts concise. Llama models perform noticeably better with shorter, focused instructions than with long system prompts. Define the role, specify output format, set behavioral constraints, and stop. Don't add paragraphs of context that could go in the user message instead.

💡A good Llama system prompt is 2-4 sentences. If yours is longer than a paragraph, move the extra context to the user message.
System prompt length
You are a senior financial analyst specializing in tech/SaaS. Provide concise analysis with bullet points. Include risk assessment and cite reasoning.
The short version gives Llama the same behavioral guidance without diluting attention. Long system prompts cause Llama to lose focus on later instructions.
4

Running Llama locally

Multiple options depending on your needs: Ollama for the easiest setup (one command install, handles everything), llama.cpp for CPU inference on machines without GPUs, vLLM for production serving with high throughput, and TGI (Text Generation Inference) for containerized deployments. Quantization lets you run larger models on smaller hardware: 4-bit (Q4) fits more models in less VRAM, 8-bit (Q8) keeps more quality.

💡Start with Ollama. It's one command to install and one command to run any model. When you need production performance, graduate to vLLM.
5

Fine-tuning with LoRA

This is where open-source models shine. Fine-tune Llama for your specific domain with LoRA (Low-Rank Adaptation) or QLoRA (quantized LoRA for less VRAM). Start with a small dataset (hundreds of examples), not thousands. Use LoRA for parameter-efficient training that takes hours instead of days. Evaluate on held-out test sets to confirm improvement. The result: a model that speaks your domain's language natively.

💡QLoRA lets you fine-tune a 70B parameter model on a single GPU. Start there if you have limited hardware. The quality difference from full LoRA is minimal for most use cases.
6

Production deployment

Three paths to production: Cloud (AWS SageMaker, Azure ML, GCP Vertex AI) for managed infrastructure, Hosted endpoints (Together AI, Fireworks, Groq) for API simplicity at lower cost than OpenAI, or Self-hosted (vLLM, TGI) for full control. Trade-offs: cloud is easiest, hosted is cheapest for moderate traffic, self-hosted is cheapest at scale and gives you complete data privacy.

💡For production, batch requests when possible and consider speculative decoding for faster inference. Groq's LPU hardware gives the fastest inference speeds if latency matters most.

Key topics covered

Llama setup
Chat templates
System prompts
Fine-tuning
RAG patterns
Deployment options
Read the full guide
View the complete Meta documentation
Official docs →

More guides