Agent Evaluation Framework
The Agent Evaluation Framework allows you to evaluate agent models with tool-augmented reasoning using “Task Bundles” - self-contained directories that include all the necessary components for testing and evaluation.
Task Bundle Structure
A task bundle is a self-contained directory with all the components needed to evaluate an agent:
my_task/
├─ reward.py # Reward function with @reward_function decorator
├─ tools.py # Tool registry for this specific task
├─ seed.sql # Initial DB state (optional)
└─ task.jsonl # Dataset rows with task specifications
CLI Usage
The agent evaluation framework is integrated with the Reward Kit CLI through the agent-eval
command.
Basic Usage
# Run agent evaluation on a task bundle
reward-kit agent-eval --task-dir ./flight_task
# You can also specify just the task.jsonl file
reward-kit agent-eval --dataset ./flight_task/task.jsonl
Environment Variables
Models can be specified using environment variables:
# Set model for agent evaluation
export MODEL_AGENT=openai/gpt-4o
# Set model for simulated user (optional)
export MODEL_SIM=openai/gpt-3.5-turbo
# Then run evaluation
reward-kit agent-eval --task-dir ./flight_task
Advanced Options
# Specify model directly (overrides environment variable)
reward-kit agent-eval --task-dir ./flight_task --model openai/gpt-4o
# Use custom output directory
reward-kit agent-eval --task-dir ./flight_task --output-dir ./my_runs
# Disable simulated user (use static initial messages only)
reward-kit agent-eval --task-dir ./flight_task --no-sim-user
# Use test mode without requiring API keys
reward-kit agent-eval --task-dir ./flight_task --test-mode
# Use mock response in test mode
reward-kit agent-eval --task-dir ./flight_task --test-mode --mock-response
# Run in debug mode with verbose output
reward-kit agent-eval --task-dir ./flight_task --debug
# Limit the number of tasks to evaluate
reward-kit agent-eval --task-dir ./flight_task --max-tasks 2
# Run specific tasks by ID
reward-kit agent-eval --task-dir ./flight_task --task-ids flight.booking.001,flight.booking.002
# Use a specific registry for a task
reward-kit agent-eval --task-dir ./flight_task --registry-override my_custom_tools.flight_tools
# Use multiple tool registries
reward-kit agent-eval --task-dir ./complex_task --registries flight=flight_tools,hotel=hotel_tools
# Specify evaluator
reward-kit agent-eval --task-dir ./flight_task --evaluator flight_reward.success_evaluator
Testing & Debugging
The CLI provides several options for testing and debugging:
# Test mode verifies tool setup without making API calls
reward-kit agent-eval --task-dir ./flight_task --test-mode
# Debug mode shows detailed information about tool execution
reward-kit agent-eval --task-dir ./flight_task --debug
# Export tools as OpenAPI spec for manual testing
reward-kit agent-eval --task-dir ./flight_task --export-tools ./tools_spec
# Validate task bundle structure and requirements
reward-kit agent-eval --task-dir ./flight_task --validate-only
Examples
Basic Flight Task Evaluation
export MODEL_AGENT=openai/gpt-4o
reward-kit agent-eval --task-dir ./examples/flight_task
Testing Without API Keys
reward-kit agent-eval --task-dir ./examples/flight_task --test-mode --mock-response
reward-kit agent-eval --task-dir ./examples/travel_task --registries flight=flight_tools,hotel=hotel_tools
Running with Specific Task IDs
reward-kit agent-eval --task-dir ./examples/flight_task --task-ids flight.booking.001,flight.booking.002
Using Debug Mode
reward-kit agent-eval --task-dir ./examples/flight_task --debug