Kamil Józwik

GitHub models

An in-depth overview of GitHub Models feature available now for all developers.

github

You're building a feature that needs AI - maybe code review summaries, automated issue triage, or intelligent search. You've experimented with ChatGPT and know roughly what you need. But now you face a dozen decisions: Which model? Which provider? How do you test different options? Where do you store prompts? How do you version them? How do you move from prototype to production without rewriting everything?

This is where GitHub Models enters the picture. It's not another AI service or model provider - it's a workspace that embeds AI development directly into GitHub, giving you tools to experiment, evaluate, and deploy AI features using the same workflows you already use for code. By the end of this article, you'll understand what GitHub Models is, why it exists, and how to decide if and when it fits your needs.

What GitHub Models Is (and Isn't)

GitHub Models is a development environment for working with large language models, integrated into GitHub's platform. Think of it as a specialized IDE for AI features - one that treats prompts as code, provides built-in testing and evaluation tools, and leverages GitHub's existing version control and collaboration infrastructure.

It provides three core capabilities:

A model marketplace with consistent access patterns. Instead of signing up for OpenAI, Anthropic, Meta, and others separately, GitHub Models gives you access to models from multiple providers through a single API. You authenticate once with your GitHub credentials and can switch between GPT-4, Llama, Claude, or DeepSeek models without changing authentication or SDK setup.

Prompt development and version control. Prompts are stored as .prompt.yml files in your repository. This means they go through pull requests, have change history, can be reviewed by teammates, and are deployed alongside your code. The platform includes an editor designed specifically for iterating on prompts - testing variations, comparing outputs, and measuring results.

Evaluation and comparison tooling. You can run the same prompt against multiple models simultaneously, compare outputs side-by-side, and use built-in evaluators (or write custom ones) to score results on dimensions like relevance, groundedness, or specific business criteria. This helps you make data-driven decisions about which model or prompt variation works best for your use case.

What GitHub Models is NOT : It's not a new model or AI service. It doesn't train models. It's not a model hosting platform where you deploy fine-tuned models. It's middleware and tooling that sits between your application and various AI providers, focused specifically on the development workflow.

Why?

Traditional AI development happens in scattered tools: experimenting in a vendor's playground, copying prompts into Notion, running evaluations in notebooks, deploying through cloud consoles, and version controlling nothing. Each step lives in a different system with different authentication and workflows.

This fragmentation creates predictable problems. Prompts drift between what's documented and what's deployed. Different team members use different models without clear rationale. Testing happens manually and inconsistently. Moving from prototype to production requires rewriting integrations. Non-technical stakeholders can't contribute to or review prompts.

GitHub Models addresses this by centralizing AI development where your code already lives. The key insight is that AI features are software features - they need version control, code review, testing, and deployment pipelines just like any other code. By treating prompts as first-class artifacts in your repository and providing AI-specific development tools within GitHub's existing workflows, it removes the context switching and scattered tooling.

This approach matters most when you're building AI features as part of a larger application, working with a team, or need to maintain and evolve AI functionality over time. It's less relevant if you're doing one-off explorations or building standalone AI applications without broader engineering workflows.

Prompts as Versioned Artifacts

The fundamental building block in GitHub Models is the prompt configuration file. These .prompt.yml or .prompt.yaml files contain everything needed to reproduce an AI interaction:

name: Code Review Summarizer
model: openai/gpt-4o
modelParameters:
  temperature: 0.3
  max_tokens: 500
messages:
  - role: system
    content: You are a code reviewer. Summarize PR changes concisely.
  - role: user
    content: |
      Review this diff:
      {{diff}}

This structure separates concerns in a useful way. The messages array contains your actual prompt - the instructions and placeholders for dynamic content. The model and modelParameters specify which model to use and how to configure it. Optional fields like testData and evaluators support development and testing workflows.

Storing prompts this way creates several advantages. First, they have full change history - you can see who changed what and when, and revert if needed. Second, they go through code review, so prompt changes get the same scrutiny as code changes. Third, they're shareable and discoverable - anyone on the team can find, understand, and reuse prompts without hunting through Slack threads or documentation.

The placeholder syntax ({{variable}}) keeps prompts generic and reusable. Your application passes in actual values at runtime, but the template remains stable in version control. This mirrors how you'd write a SQL query or a template in other contexts - the structure is versioned, the data is supplied dynamically.

Rapid Experimentation

The Playground is GitHub Models' interactive environment for quick exploration. You pick a model, type a prompt, adjust parameters like temperature or token limits, and see results immediately. No API keys, no SDK setup, no authentication complexity - just experimentation.

Its value is speed and low friction. When you're starting a new AI feature, you often don't know which model will work well or how to phrase your prompt. The Playground lets you try different approaches rapidly. You can compare two models side-by-side by submitting the same prompt to both, revealing differences in response style, accuracy, or cost.

Parameters like temperature (randomness), max tokens (response length), and frequency penalty (repetition control) significantly affect model behavior, but their impact isn't always intuitive. The Playground makes parameter tuning interactive - adjust a slider, rerun the prompt, see how the output changes. This builds intuition about model behavior faster than reading documentation.

The limitation is that the Playground is for exploration, not production use. It has rate limits designed for experimentation (typically 10-20 requests per minute depending on model tier and your GitHub plan). You can't handle user traffic through the Playground. Once you understand what works, you move to the API or integrate prompts into your application.

Comparisons and Evaluations

Choosing between models often feels arbitrary without systematic comparison. GPT-5 is expensive but capable. Llama 3 is open-source and cheaper. Claude excels at certain reasoning tasks. DeepSeek offers different trade-offs. How do you decide?

The Comparisons view provides a structured approach. You create multiple prompt configurations - same prompts with different models, or different prompt variations with the same model - and run them against rows of test inputs simultaneously. Results appear in a grid, showing outputs side-by-side along with metadata like latency and token usage.

This visualization makes differences concrete. You might discover that GPT-4o and GPT-4o-mini produce nearly identical results for your use case, but mini costs 10x less. Or that a cheaper model works fine for 80% of cases, suggesting a tiered approach. Or that your prompt wording has more impact on quality than model choice, pointing you toward prompt optimization rather than upgrading models.

Evaluators formalize this comparison. Instead of reading outputs manually and making subjective judgments, you define criteria and score outputs automatically.

GitHub Models provides built-in evaluators:

  • Similarity measures how close an output is to an expected result using embedding-based comparison. Useful when you have known-good examples and want consistency.

  • Relevance scores whether the output addresses the user's query appropriately. Helps catch responses that are factually correct but miss the point.

  • Groundedness checks if outputs stay faithful to provided context rather than hallucinating information. Critical for RAG (retrieval-augmented generation) applications.

  • String checks verify outputs contain or avoid specific strings, useful for format requirements or safety guardrails.

You can also write custom evaluators as prompts themselves - using one LLM to judge another's output against your specific criteria. This "LLM-as-judge" pattern works surprisingly well for nuanced business requirements that simple pattern matching can't capture.

The test data section of prompt files integrates with this evaluation system. You define input-output pairs, run evaluations, and get quantitative scores across different configurations. This transforms model selection from guesswork into a data-driven decision with clear trade-off visibility.

From Development to Deployment

GitHub Models isn't isolated - it integrates with the tools developers already use. Understanding these integration patterns helps you see where it fits in your broader workflow.

GitHub Actions can call models directly using the repository's GITHUB_TOKEN, no separate authentication needed. This enables AI features in CI/CD: automatic PR summarization, issue triage, test case generation, documentation updates. Since prompts are versioned in the repo, Actions workflows can reference them by file path, ensuring consistency between what you tested and what runs in automation.

GitHub CLI extension brings models to the command line. You can pipe command output into a model, get explanations of error messages, or generate scripts from natural language descriptions. The key advantage is staying in your terminal without context switching to a web interface or API client.

GitHub Copilot Chat extension allows you to use models beyond Copilot's default model within your IDE. You can ask for model recommendations based on criteria ("suggest a cheap model good at JSON parsing") or use specific models for particular tasks. This is particularly useful when you want different capabilities than Copilot's general-purpose model provides.

REST API provides programmatic access with standard OpenAI-compatible endpoints. If you've built against OpenAI's API, switching to GitHub Models often requires only changing the base URL and authentication header. This compatibility reduces integration friction and makes model switching straightforward.

These integrations share a common pattern: authenticate once with GitHub credentials, then access multiple models through consistent interfaces. The barrier to trying a new model or updating a prompt is low because the plumbing is already in place.

Cost Models

GitHub Models operates on two tiers: free rate-limited usage for prototyping, and paid usage for production workloads.

The free tier includes everyone with a GitHub account. You can make 10-20 requests per minute (depending on model and your Copilot subscription level), up to 50-300 per day, with token limits per request. These quotas are designed for exploration, not production traffic. If you hit limits, you wait until the rate limit window resets.

Rate limits vary by model complexity and your GitHub subscription:

  • Low-tier models (smaller, faster, cheaper): ~15 RPM, 150 requests/day
  • High-tier models (larger, slower, more expensive): ~10 RPM, 50 requests/day
  • Latest reasoning models (o1, o3, DeepSeek-R1): 1-2 RPM, 8-20 requests/day
  • Copilot Enterprise users get 1.5-3x higher limits than free accounts

For production use, you opt into paid usage at the organization or enterprise level. GitHub Models uses unified pricing: $0.00001 per token unit, where a token unit is calculated by multiplying actual tokens by model-specific multipliers. For example, GPT-4o uses 0.25x multiplier for input tokens and 1x for output tokens - 1 million input tokens costs $2.50, 1 million output tokens costs $10. This unified pricing simplifies budgeting compared to tracking different rates across multiple providers.

The decision between free and paid isn't just about volume. Free tier has smaller context windows (typically 8K input, 4K output vs 16K+ in paid) and lacks features like custom evaluators at scale. If your feature needs large document processing, complex multi-turn conversations, or serves user traffic, paid tier is necessary. For developer tools that run occasionally, prototyping, or learning, free tier often suffices.

A third option exists: bring your own keys (BYOK). If you already have OpenAI or Azure subscriptions with negotiated rates or committed spend, you can connect those directly to GitHub Models. You use GitHub's development and evaluation tools, but billing and quota enforcement happen through your existing provider account. This helps organizations standardize on GitHub's workflow tools while maintaining their preferred vendor relationships and pricing.

Organizational Control and Governance

At enterprise scale, letting every developer use any model without oversight creates problems: unexpected costs from expensive models, compliance issues from unapproved providers, inconsistent capabilities across teams.

GitHub Models provides governance controls at the enterprise and organization level. Administrators can:

  • Enable or disable GitHub Models entirely for their organization, controlling whether teams have access at all.

  • Create allowlists or blocklists for models and publishers. You might approve only certain OpenAI models, or block specific providers due to compliance requirements. These lists can be precise (blocking a single model) or broad (allowing an entire publisher).

  • Bring custom models via API keys so teams can use organization-approved models not in GitHub's default catalog. This balances flexibility with control - teams get access to specialized models, but administrators manage the keys and billing.

  • Monitor usage through logs and billing dashboards, seeing which teams use which models and how much they spend. This visibility helps with cost optimization and identifying unexpected usage patterns.

The principle is centralized governance with distributed development. Teams retain flexibility to experiment and choose appropriate models for their features, while the organization maintains control over costs, compliance, and vendor relationships.

For individual developers or small teams, most of these controls aren't relevant. But understanding them matters when evaluating GitHub Models for enterprise adoption - the platform scales from personal experimentation to organization-wide standards.

When GitHub Models Fits

GitHub Models makes sense in specific contexts and less sense in others. Here's how to think about fit.

Use GitHub Models when:

  • You're building AI features as part of a larger application that lives in GitHub. The integration with repositories, Actions, and pull requests provides the most value when your AI work happens alongside other development.

  • You need to compare models systematically. If you're evaluating which model works best for your use case, or if you expect to switch models as better options become available, the comparison and evaluation tools save significant time.

  • You're working with a team. The version control, code review, and shared workspace aspects matter more with multiple people involved. Solo developers get less benefit from collaboration features.

  • You want model-agnostic infrastructure. By abstracting behind a consistent API, GitHub Models reduces vendor lock-in and makes model switching tractable. This matters if you expect the AI landscape to keep shifting.

  • You're in the GitHub ecosystem already. If you use GitHub Actions for CI/CD, GitHub CLI for development, or GitHub Copilot for coding assistance, adding GitHub Models continues that integration.

Don't use GitHub Models when:

  • You need specialized models not in the catalog. While BYOK supports OpenAI and Azure, if your requirements demand HuggingFace models, local deployment, or fine-tuned custom models, GitHub Models isn't designed for that.

  • You're doing ML training or fine-tuning. GitHub Models is about using existing models, not training new ones. For that, you need dedicated ML platforms.

  • You have simple, one-off AI needs. If you just want to call ChatGPT occasionally from a script, GitHub Models adds unnecessary complexity. Direct API calls to a provider are simpler.

  • Your application has zero GitHub presence. If your code lives in GitLab, Bitbucket, or internal systems, the GitHub-specific features provide no value. You're better served by provider SDKs or frameworks like LangChain.

  • You need the absolute lowest latency. Going through GitHub's infrastructure adds a small proxy layer. For latency-critical applications, direct provider APIs may be marginally faster.

Prompts vs Frameworks

A common question: how does GitHub Models relate to frameworks like LangChain, LlamaIndex, or Semantic Kernel?

These tools solve different problems. LangChain and similar frameworks provide abstractions for building complex AI applications: chain of thought reasoning, document retrieval, agent loops, memory management, tool use. They help you compose multiple LLM calls into sophisticated workflows.

GitHub Models focuses on the development experience for individual AI interactions: testing models, evaluating outputs, versioning prompts, integrating with development workflows. It provides less abstraction but more development tooling.

You can use both together. A typical pattern: develop and test prompts in GitHub Models, then deploy them using LangChain for orchestration. Store your .prompt.yml files alongside your LangChain code, reference them in your chains, and leverage GitHub Models' evaluation tools to validate changes.

The choice depends on your application's complexity. If you're adding a single AI feature with straightforward prompt-response patterns, GitHub Models alone may suffice. If you're building a RAG system with document chunking, vector search, and multi-step reasoning, you'll likely need a framework - but can still use GitHub Models for individual prompt development and testing.

Real-World Patterns

✅ GitHub Models works best for specific categories of features:

  • Developer tooling. Automatic PR summaries, issue triage, test case generation, documentation updates. These run in GitHub Actions, have clear success criteria, and benefit from versioned prompts that evolve alongside your codebase.

  • Content operations. Generating release notes, summarizing discussions, classifying feedback. These often run on schedules or webhooks, process GitHub data, and need consistency over time that versioned prompts provide.

  • Command-line assistance. Using the CLI extension to explain errors, generate scripts, or query repositories in natural language. The value is staying in your terminal without context switching.

  • API prototyping. Quickly testing which model and prompt work for your use case before committing to production infrastructure. The free tier and comparison tools accelerate this exploration.

❌ Common anti-patterns also emerge:

  • Real-time chat interfaces. While possible, the rate limits and proxy overhead make GitHub Models less suitable than direct provider APIs for user-facing chat applications with high traffic.

  • Fine-tuning workflows. GitHub Models doesn't support model training or fine-tuning. If you need task-specific model customization beyond prompting, you need different tools.

  • Complex multi-agent systems. Orchestrating multiple AI agents with state management, tool calling, and branching logic requires frameworks like LangChain or AutoGPT, not GitHub Models' single-prompt focus.

  • Mobile or edge applications. GitHub Models is cloud-based and tied to GitHub authentication. For on-device AI or offline functionality, you need different deployment models.

What You Should Remember

GitHub Models is a development workspace for AI features, not an AI service itself. It provides access to multiple models through a unified interface, development tools for prompts, evaluation capabilities, and integration with GitHub workflows.

Its core value proposition is treating prompts as code: versioned, reviewed, tested, and deployed through the same processes as your other software. This matters most when building AI features as part of team-developed applications rather than standalone AI products.

The platform operates on a free tier for experimentation (with rate limits) and paid tier for production (with usage-based pricing). Organizations can also bring their own API keys to use existing vendor relationships while leveraging GitHub's development tools.

Choose GitHub Models when you need systematic model comparison, team collaboration on prompts, or integration with GitHub-based development workflows. Skip it when you need model training, have no GitHub presence, or need capabilities outside the provided model catalog.

As AI development matures, tools like GitHub Models that reduce friction between prototyping and production, enable systematic evaluation, and treat AI artifacts as first-class components of your codebase become increasingly valuable. The specific features will evolve, but the underlying approach - embedding AI development into existing software workflows - represents where the ecosystem is heading.

Where to Go From Here

If GitHub Models sounds useful for your context, start with the model playground. Pick a model, experiment with prompts, and compare responses across different models. This hands-on exploration builds intuition faster than reading documentation.

When ready to integrate with your actual development workflow, create a test repository and try storing a prompt file. Use the Comparisons view to evaluate different approaches against your real use case data. This reveals whether the tools solve your actual problems.

For production deployment, review the billing documentation to understand costs, examine rate limits for your subscription tier, and consider whether BYOK makes sense for your organization's vendor relationships.

The GitHub Models REST API reference provides full technical details when you're ready to implement. The GitHub community discussions offer a space to see how others are using the platform and ask questions.

Most importantly, GitHub Models is a tool for a specific context: developing AI features as part of broader applications using GitHub workflows. It won't fit every situation, but when it does, it removes significant friction from the AI development process.