// docs · v0.4

Train agents in environments that don't exist yet.

AgentGYM is the platform where AI teams describe the environment their agent needs and get back a live, scored, fully instrumented testing ground in under 10 seconds. These docs walk you from $ install to your first scored episode in five minutes.

Quickstart

Get from zero to a scored episode in five minutes.

1. Install the CLI

# curl installer
curl -sSL get.agentgym.dev | sh

# or via Homebrew
brew install agentgym/tap/agentgym

# verify
agentgym --version
# agentgym 0.4.1 (build 2026.04.28)

2. Generate your first environment

agentgym new

# ▸ describe your environment:
# CRM for Salesforce agents

▸ parsing brief...
✓ domain identified: CRM
▸ generating schema...
✓ Contact  (12 fields · 50 seed rows)
✓ Company  (8 fields · 20 seed rows)
✓ Deal     (15 fields · 30 seed rows)
✓ Activity (6 fields · 100 seed rows)
▸ generating tasks...
✓ 112 tasks across 4 tiers
▸ spinning up container...
✓ env_crm_v1_a3f8c2 is live.
→ env.agentgym.dev/e/env_crm_v1_a3f8c2

3. Run an agent against it

agentgym run env_crm_v1_a3f8c2 --agent ./my_agent.py --tier 2

▸ loading environment... ok
▸ connecting agent...    ok
▸ running tier-2 tasks...
  step 1  search_contacts  → 3 results
  step 2  get_contact      id=c_042 → ok
  step 3  create_deal      → deal_id=d_019
  step 4  done             ✓ 0.94s

score: 1.00 · PASS · 4 steps

That's it. You just generated a deterministic environment, ran an agent, and scored it.

CLI reference

All commands accept --help for full flag docs.

Command	Description
`agentgym new`	Generate a new environment from a brief
`agentgym ls`	List your environments
`agentgym run <env>`	Run an agent against an environment
`agentgym score <run>`	Print the scorecard for a run
`agentgym replay <run>`	Replay a past episode step by step
`agentgym deploy`	Publish an environment to the registry

agentgym new

agentgym new [flags]

Flags:
  --prompt    string   environment brief (skips interactive prompt)
  --domain    string   hint: crm | healthcare | itsm | browser | terminal | erp | api
  --tasks     int      number of tasks to generate (default: auto)
  --complexity string  low | medium | high | all (default: all)

agentgym run

agentgym run <env> [flags]

Flags:
  --agent     string   path to agent script or docker image
  --tier      int      task tier 1-4 (default: all)
  --task      string   run a single task by id
  --parallel  int      concurrent episodes (default: 1)
  --threshold float    fail if score < threshold (default: 0.85)

agentgym score

agentgym score <run>

# Example output:
score: 0.85 · MISSING_CONFIRMATION

✓ state_correct     · 1.00
✓ steps_efficient   · 0.90
✗ confirmation_sent · 0.00

→ agent skipped required notification step

Concepts

Primitive	Definition
`environment`	Versioned, container-backed simulation. Schema + seed + tasks + validators.
`episode`	One run of one task by one agent. Has a deterministic seed and replay log.
`task`	Scoped objective with one correct final state. Tiered 1–4 by complexity.
`validator`	Pure function over post-episode DB state. Returns score in [0,1].
`tier`	T1 lookup · T2 single-write · T3 multi-field · T4 workflow.

Writing a good brief

Three rules for a brief that generates a great environment:

Name the domain — CRM, healthcare scheduling, IT helpdesk, etc.
Name the entities — contacts, companies, deals; patients, providers, appointments.
Name the constraints — “never expose PHI to the wrong patient”, “respect SLA tiers”.

Avoid: “an agent gym for sales” (too vague). Name the domain, entities, and constraints specifically — the more context you give, the better the generated tasks and validators.

Scoring & failure taxonomy

Score = weighted sum of validators, in [0,1]. A run with score ≥ 0.85 is considered a pass by default (configurable with --threshold).

Every failure gets a label from a fixed taxonomy so you can track regressions across runs:

Label	Meaning
`WRONG_ACTION`	Agent took action that did not match task intent
`MISSING_STEP`	Final state correct, required intermediate step skipped
`HALLUCINATION`	Agent referenced record/field/value that doesn't exist
`SAFETY_VIOLATION`	Agent modified records outside scope granted by task
`TIMEOUT`	Agent ran to max_steps without emitting done
`LOOP`	Agent emitted same action 3+ times in a row
`MISSING_CONFIRMATION`	Agent skipped required notification/confirmation step

Connecting an agent

AgentGYM exposes a simple REST API. Use the official Python SDK, LangChain integration, or call the API directly.

Python SDK

from agentgym import Gym

gym = Gym("env_crm_v1_a3f8c2", api_key="ag_...")

for episode in gym.episodes(tier=2):
    obs = episode.reset()
    while not episode.done:
        action = my_agent(obs)
        obs = episode.act(action)
    print(f"score: {episode.score}  failure: {episode.failure}")

REST

# Start an episode
POST /e/env_crm_v1_a3f8c2/episodes
→ { "episode_id": "ep_7f3a1b", "observation": {...} }

# Step
POST /episodes/ep_7f3a1b/step
{ "action": "search_contacts", "params": { "email": "..." } }
→ { "observation": {...}, "done": false }

# Done
POST /episodes/ep_7f3a1b/done
→ { "score": 0.92, "failure": null }

LangChain

from agentgym.integrations.langchain import AgentGymTools
from langchain.agents import AgentExecutor

tools = AgentGymTools(env="env_crm_v1_a3f8c2").as_tools()
agent = AgentExecutor(agent=my_lc_agent, tools=tools)
agent.invoke({"input": episode.task.description})

GitHub Actions

Drop the action into any workflow. It runs your environment suite on every PR and fails if any score falls below threshold.

- uses: agentgym/run-suite@v1
  with:
    env: env_crm_v1_a3f8c2
    agent: ./my_agent.py
    tier_min: 2
    threshold: 0.85
    api_key: ${{ secrets.AGENTGYM_API_KEY }}

Custom environments

The Environment SDK is a Docker image, a task definition, and a grader. Three files. Run agentgym new --sdk to scaffold.

API reference

Base URL: https://api.agentgym.dev · OpenAPI spec

Endpoint	Description
`POST /environments`	Create environment from brief
`GET /environments/:id`	Get environment metadata
`POST /environments/:id/episodes`	Start a new episode
`POST /episodes/:id/step`	Send an action, get observation
`POST /episodes/:id/done`	End episode and get score
`GET /episodes/:id/replay`	Get full replay log

Questions? Email sumedh@mitwa.ai or open an issue on GitHub.