Can I swap one model for another? It is a simple question with no consistent method for answering confidently. Divergences of one for or another are hard to trap. So we set out to provide that method, and today launched Eval Protocol, an OSS library and SDK for making model evaluations work like unit tests, and through CI/CD automation.
Introducing Eval Protocol (EP)
EP is an open protocol that standardizes how developers author evaluations for large language model (LLM) applications. EP provides a specification for writing evals and storing eval results that travel with developers from local model picking and prompt engineering, through production CI/CD, to automated fine-tuning and reinforcement learning for real-world use-cases- from simple markdown and JSON generation to complex customer service agents with tool calling.
EP bridges the gap between quick wins and long-term customization. Developers can start with immediate benefits like automated CI/CD checks to prevent regressions today, then scale to sophisticated multi-turn evaluations using Model Context Protocol (MCP) for agent optimization tomorrow.
Quickstart
EP can be installed with pip install eval-protocol
In this simple single-turn eval example, we run a mark for checking instruction-following test:
The Problem: LLM agent are software, but not developed with the same rigor
Building LLM apps involves juggling prompts, models, data, environments, and evals themselves. Today, developers face a fractured workflow: one-off scripts for model selection, disjointed tracing for production, and custom code for fine-tuning. Our opinion is that, if a developer wants to be confident with the quality of the LLM agent, we need to bring the same SDLC lifecycle back into LLM agents, with the familiar concepts of unit testing & CI/CD front and center. For example, you may have the following question around your application:
Can I swap models without losing quality? (Manual swaps in API clients)
Did my deployment cause a regression? Is my new prompt better or worse? (Ad-hoc scripts and console checks)
Is my fine tuned model solving real problems?
Three Core Workflows, One Protocol
EP is designed around your AI journey, with emphasis on getting quick value from CI/CD while paving the way for advanced customization.
Local Evals & Model Picking: Kick off in your notebook. Use the eval-protocol library to benchmark models on your use case. Answer "Can I switch from Claude to Kimi?" with reproducible results. Supports row-wise rewards for single-turn rollouts and test MCP for multi-turn simulations with user agents.
Scale Up & CI/CD: This is where EP shines for developers right now—no waiting for complex setups. Use the same eval suite on production logs for monitoring. Integrate into CI/CD via GitHub Actions: Run evals on PRs against golden datasets, get pass/fail comments with regression examples. Prevent issues before merge, ensuring "Did I break anything?" is answered automatically. Start simple with single-turn evals, then layer in MCP for multi-turn as your app grows.
Customized model: As things grow, EP helps you evaluate your own model and own your AI stack. Curate bad traces into datasets, and re-evaluate with your original suite for proven improvements. This carries you from basic quality gates on top of proprietary models to evaluating your customized models.
EP evolves with you.
Resources:
Get started with Installing Eval Protocol: pip install eval-protocol