Kamil Józwik
LLM model logo

Llama

Llama is a family of AI models developed by Meta AI, representing the company's commitment to democratizing artificial intelligence through open weights distribution. Unlike closed models from OpenAI, Google, and Anthropic, Llama models are downloadable with publicly available weights, enabling full customization, local deployment, and data sovereignty. The family philosophy centers on building an industry-standard ecosystem rather than proprietary lock-in, inspired by Linux's success against closed Unix variants.

The family currently includes foundation models (Llama 3.3), multimodal models (Llama 4 Scout and Maverick, Llama 3.2 Vision), and edge-optimized models (Llama 3.2 1B/3B), with strengths in cost efficiency, portability, and transparency. Llama 4 introduces Mixture-of-Experts (MoE) architecture with native multimodality, making it the first open-weight MoE model family. Models are trained on 15-40 trillion tokens with optimization for production deployment through quantization support (BF16 to FP8/INT4).

Key differentiators include true model ownership (download, modify, deploy without vendor lock-in), lowest cost per token in the industry, full data security for sensitive industries, no API rate limits when self-hosted, and extensive hardware support across NVIDIA, AMD, Qualcomm, MediaTek, and Arm. The ecosystem has achieved 300M+ downloads with 85,000+ derivatives on HuggingFace. Llama is the right choice when you need data sovereignty, on-premises deployment, model customization on proprietary data, transparency and auditability, or want to avoid API pricing volatility.


Platform & Access

Platform Name: Llama API + Llama Stack

Official URLs:

What the Platform Offers:

Llama API (limited free preview as of April 2025) provides one-click API key creation, interactive playgrounds, lightweight SDKs (Python, TypeScript), OpenAI SDK compatibility, fine-tuning tools, evaluation suite, and complete model management with privacy guarantee (no data used for Meta's training).

Llama Stack (production-ready open source) offers standardized infrastructure framework for building generative AI applications with unified API layer for Inference, RAG, Agents, Tools, Safety, and Evaluations. Features plugin architecture supporting local, on-premises, cloud, and mobile environments with pre-configured distributions and SDKs for Python, TypeScript, Swift, and Kotlin.

Access Models:

Three primary methods:

  1. Open weights download from llama.com or HuggingFace (requires license acceptance, self-host on own infrastructure),
  2. Llama API hosted by Meta (limited free preview, waitlist required),
  3. Third-party cloud providers including AWS Bedrock, Google Cloud Vertex AI, Azure, Together AI, Fireworks, Groq, Cerebras, Replicate, and OpenRouter.

Pricing Model:

Llama API currently free during preview (production pricing TBA). Third-party providers charge token-based pricing typically $0.10-$0.90 per million tokens. Self-hosted has zero API costs after initial hardware investment.

License: Llama Community License (custom commercial license, royalty-free for companies with <700M monthly active users)


Foundation Models

Llama 3.3 70B Instruct

Primary Use Cases:

Agentic Capabilities:


Multimodal / Vision Models

Llama 4 Maverick (17B-128E)

Primary Use Cases:

Agentic Capabilities:


Llama 3.2 90B Vision Instruct

Primary Use Cases:

Agentic Capabilities:


Llama 3.2 11B Vision Instruct

Primary Use Cases:

Agentic Capabilities:


Lightweight / Edge Models

Llama 4 Scout (17B-16E)

Primary Use Cases:

Agentic Capabilities:


Llama 3.2 3B Instruct

Primary Use Cases:

Agentic Capabilities:


Llama 3.2 1B Instruct

Primary Use Cases:

Agentic Capabilities:


Preview Models (Not Released)

Llama 4 Behemoth (288B-16E) [IN TRAINING - NOT RELEASED]

Expected Primary Use Cases:

Agentic Capabilities: Not yet disclosed

Notable Features: Teacher model for codistilling Scout and Maverick with novel distillation loss function, outperforms GPT-4.5, Claude 3.7 Sonnet, Gemini 2.0 Pro on MATH-500 and GPQA Diamond per Meta internal testing, not a dedicated reasoning model (does not use extended chain-of-thought like o1/o3)


Model Comparison Table

ModelContextParametersKnowledge CutoffAgentic UseBest For
Llama 3.3 70B128K70BDec 2023⭐⭐⭐⭐⭐Advanced reasoning, model distillation, cost-effective flagship performance
Llama 4 Maverick1M17B/400B MoEAug 2024⭐⭐⭐⭐⭐General-purpose multimodal, visual document analysis, enterprise AI
Llama 4 Scout10M17B/109B MoEAug 2024⭐⭐⭐⭐Long-context specialists, codebase analysis, video processing, edge deployment
Llama 3.2 90B Vision128K90BDec 2023⭐⭐⭐⭐Enterprise document understanding, medical imaging, financial analysis
Llama 3.2 11B Vision128K11BDec 2023⭐⭐⭐Cost-effective vision tasks, accessibility tools, content moderation
Llama 3.2 3B128K3BDec 2023⭐⭐⭐Mobile AI, on-device inference, privacy-focused applications
Llama 3.2 1B128K1BDec 2023⭐⭐Ultra-lightweight IoT, embedded systems, battery-constrained devices

Key Considerations

Licensing Restrictions: Companies with >700 million monthly active users require special license from Meta. EU restrictions apply to Llama 3.2 vision models (11B, 90B) and Llama 3.1 405B - not available to EU-domiciled individuals/companies. Cannot use outputs to train other LLMs (except Llama variants) in earlier versions; Llama 3.1+ allows distillation. Must display "Built with Llama" attribution on website/documentation.

Regional Availability: Llama 3.2 vision models and Llama 3.1 405B NOT available in European Union due to GDPR/AI Act regulatory concerns. Meta AI app available in 43 countries, excludes EU, China, most of Asia/Middle East. Text-only models generally available globally for download and self-hosting. Cloud provider availability varies (AWS Bedrock primarily US East/West; Azure limited in EU regions).

Hardware Requirements for Self-Hosting: 1B-3B models require 2-12GB VRAM (RTX 3060 sufficient). 70B models require 35-140GB VRAM (dual RTX 3090/4090 or A100). 405B models require 232-810GB VRAM (8x A100 80GB minimum). Quantization (INT4) reduces memory requirements by ~75%. Llama 4 Scout/Maverick fit on single NVIDIA H100 with Int4 quantization.

Context Length Trade-offs: While models support 128K-10M tokens, effective context often ~50% of training length. Performance exhibits "U-shaped" curve - better at start/end than middle of context. Cloud providers may cap context (Azure serverless at 4K-8K despite 128K support). Approaching upper context limits reduces inference speed and increases errors.

Model Deprecation Patterns: Major releases roughly annually. Cloud providers deprecate older versions with 2-4 weeks advance notice. Recent deprecations include Llama 3.0 replaced by 3.1/3.3. Recommendation: Use explicit version IDs and test before auto-upgrades.


Resources

Official Documentation:

Model Weights & Cards:

Platform & Infrastructure: