// docs · v0.4
Train agents in environments that don't exist yet.
AgentGYM is the platform where AI teams describe the environment their agent needs and get back a live, scored, fully instrumented testing ground in under 10 seconds. These docs walk you from $ install to your first scored episode in five minutes.
Quickstart
Get from zero to a scored episode in five minutes.
1. Install the CLI
# curl installer
curl -sSL get.agentgym.dev | sh
# or via Homebrew
brew install agentgym/tap/agentgym
# verify
agentgym --version
# agentgym 0.4.1 (build 2026.04.28)2. Generate your first environment
agentgym new
# ▸ describe your environment:
# CRM for Salesforce agents
▸ parsing brief...
✓ domain identified: CRM
▸ generating schema...
✓ Contact (12 fields · 50 seed rows)
✓ Company (8 fields · 20 seed rows)
✓ Deal (15 fields · 30 seed rows)
✓ Activity (6 fields · 100 seed rows)
▸ generating tasks...
✓ 112 tasks across 4 tiers
▸ spinning up container...
✓ env_crm_v1_a3f8c2 is live.
→ env.agentgym.dev/e/env_crm_v1_a3f8c23. Run an agent against it
agentgym run env_crm_v1_a3f8c2 --agent ./my_agent.py --tier 2
▸ loading environment... ok
▸ connecting agent... ok
▸ running tier-2 tasks...
step 1 search_contacts → 3 results
step 2 get_contact id=c_042 → ok
step 3 create_deal → deal_id=d_019
step 4 done ✓ 0.94s
score: 1.00 · PASS · 4 stepsCLI reference
All commands accept --help for full flag docs.
| Command | Description |
|---|---|
agentgym new | Generate a new environment from a brief |
agentgym ls | List your environments |
agentgym run <env> | Run an agent against an environment |
agentgym score <run> | Print the scorecard for a run |
agentgym replay <run> | Replay a past episode step by step |
agentgym deploy | Publish an environment to the registry |
agentgym new
agentgym new [flags]
Flags:
--prompt string environment brief (skips interactive prompt)
--domain string hint: crm | healthcare | itsm | browser | terminal | erp | api
--tasks int number of tasks to generate (default: auto)
--complexity string low | medium | high | all (default: all)agentgym run
agentgym run <env> [flags]
Flags:
--agent string path to agent script or docker image
--tier int task tier 1-4 (default: all)
--task string run a single task by id
--parallel int concurrent episodes (default: 1)
--threshold float fail if score < threshold (default: 0.85)agentgym score
agentgym score <run>
# Example output:
score: 0.85 · MISSING_CONFIRMATION
✓ state_correct · 1.00
✓ steps_efficient · 0.90
✗ confirmation_sent · 0.00
→ agent skipped required notification stepConcepts
| Primitive | Definition |
|---|---|
environment | Versioned, container-backed simulation. Schema + seed + tasks + validators. |
episode | One run of one task by one agent. Has a deterministic seed and replay log. |
task | Scoped objective with one correct final state. Tiered 1–4 by complexity. |
validator | Pure function over post-episode DB state. Returns score in [0,1]. |
tier | T1 lookup · T2 single-write · T3 multi-field · T4 workflow. |
Writing a good brief
Three rules for a brief that generates a great environment:
- Name the domain — CRM, healthcare scheduling, IT helpdesk, etc.
- Name the entities — contacts, companies, deals; patients, providers, appointments.
- Name the constraints — “never expose PHI to the wrong patient”, “respect SLA tiers”.
Scoring & failure taxonomy
Score = weighted sum of validators, in [0,1]. A run with score ≥ 0.85 is considered a pass by default (configurable with --threshold).
Every failure gets a label from a fixed taxonomy so you can track regressions across runs:
| Label | Meaning |
|---|---|
WRONG_ACTION | Agent took action that did not match task intent |
MISSING_STEP | Final state correct, required intermediate step skipped |
HALLUCINATION | Agent referenced record/field/value that doesn't exist |
SAFETY_VIOLATION | Agent modified records outside scope granted by task |
TIMEOUT | Agent ran to max_steps without emitting done |
LOOP | Agent emitted same action 3+ times in a row |
MISSING_CONFIRMATION | Agent skipped required notification/confirmation step |
Connecting an agent
AgentGYM exposes a simple REST API. Use the official Python SDK, LangChain integration, or call the API directly.
Python SDK
from agentgym import Gym
gym = Gym("env_crm_v1_a3f8c2", api_key="ag_...")
for episode in gym.episodes(tier=2):
obs = episode.reset()
while not episode.done:
action = my_agent(obs)
obs = episode.act(action)
print(f"score: {episode.score} failure: {episode.failure}")REST
# Start an episode
POST /e/env_crm_v1_a3f8c2/episodes
→ { "episode_id": "ep_7f3a1b", "observation": {...} }
# Step
POST /episodes/ep_7f3a1b/step
{ "action": "search_contacts", "params": { "email": "..." } }
→ { "observation": {...}, "done": false }
# Done
POST /episodes/ep_7f3a1b/done
→ { "score": 0.92, "failure": null }LangChain
from agentgym.integrations.langchain import AgentGymTools
from langchain.agents import AgentExecutor
tools = AgentGymTools(env="env_crm_v1_a3f8c2").as_tools()
agent = AgentExecutor(agent=my_lc_agent, tools=tools)
agent.invoke({"input": episode.task.description})GitHub Actions
Drop the action into any workflow. It runs your environment suite on every PR and fails if any score falls below threshold.
- uses: agentgym/run-suite@v1
with:
env: env_crm_v1_a3f8c2
agent: ./my_agent.py
tier_min: 2
threshold: 0.85
api_key: ${{ secrets.AGENTGYM_API_KEY }}Custom environments
The Environment SDK is a Docker image, a task definition, and a grader. Three files. Run agentgym new --sdk to scaffold.
API reference
Base URL: https://api.agentgym.dev · OpenAPI spec
| Endpoint | Description |
|---|---|
POST /environments | Create environment from brief |
GET /environments/:id | Get environment metadata |
POST /environments/:id/episodes | Start a new episode |
POST /episodes/:id/step | Send an action, get observation |
POST /episodes/:id/done | End episode and get score |
GET /episodes/:id/replay | Get full replay log |