Writing Effective Acceptance Tests for AI Agent Specs

Acceptance tests are the bridge between a one-line mission and a verifiable deliverable: they define exactly how to determine whether an agent (or module) satisfies the spec. Good acceptance tests are precise, automatable when possible, and scoped so a single agent can run them end-to-end.

1. What an acceptance test must include

Each acceptance test entry in your spec should contain:

Title — short, unique identifier (e.g., “Create-user-API-200”).

Purpose — one sentence describing the behavior being validated.

Preconditions / Setup — exact state, commands, or fixtures required before the test runs (files, env vars, DB records, mock responses).

Input — concrete request or action (HTTP request with path/headers/body, CLI command, file content). Paste a runnable example.

Expected output / assertions — deterministic checks (status codes, exact JSON fields and types, file existence, hashes, messages). Use precise comparisons, not vague language.

Pass/fail criteria — binary rule(s) that decide acceptance, e.g., “HTTP 201 and response.id matches DB row with email X”.

Automation hint — suggested test command or small script that an orchestrator or CI can run (e.g., “curl … && jq …” or “pytest tests/test_user_api.py::test_create_user”).

2. Test types to include in an agent spec

Cover a mix of fast, deterministic checks and a few higher-level flows:

  • Unit-style checks: single-function outputs, pure logic (fast, isolated).
  • Integration checks: API contracts, schema validation, file-format outputs.
  • End-to-end flows: the minimal happy-path the agent must complete (one agent should be able to run it).
  • Edge / failure cases: predictable errors the agent must detect and handle (rate limit response, missing field).
  • Deterministic snapshot tests: file content hashes, normalized JSON diffs to prevent flaky assertions.

3. Make tests automatable and idempotent

– Use unique, namespaced test data (timestamps or GUIDs) so repeated runs won’t collide.
– Provide commands to reset/clean state (DB cleanup SQL, temp-folder path).
– Prefer assertions that tolerate non-deterministic fields (e.g., assert schema and types, not raw timestamps) or normalize them before comparison.

4. Examples — three reusable test templates

API contract (curl + jq)

Preconditions: service running at http://localhost:8080, test DB seeded with user@example.test

Input (runnable):

curl -s -X POST http://localhost:8080/users -H 'Content-Type: application/json' -d '{"email":"test+{UUID}@example.test","name":"Test User"}'

Assertions (runnable):

RESPONSE=$(curl -s -X POST ...); echo "$RESPONSE" | jq -e '.status=="created" and (.id|type=="string")'

Pass rule: jq exits 0. Automation hint: include full command in “Commands” section.

CLI task (exit codes + output)

Input: ./tool generate-report --input tests/fixtures/min.csv --out /tmp/report.pdf

Assertions: exit code 0, file /tmp/report.pdf exists and size > 10KB. Automation hint: run in CI runner and assert with shell checks.

End-to-end handoff (file + schema)

Preconditions: agent A writes artifacts to artifacts/ready/.

Input: run agent to produce artifacts: ./agents/agent-a --produce

Assertions: for each JSON file in artifacts/ready, validate against schema artifacts/schema/record.schema.json using a validator; failure if any file fails validation. Automation hint: provide a one-line validator command like find artifacts/ready -name "*.json" -exec ajv validate -s artifacts/schema/record.schema.json -d {} ;

5. Organizing acceptance tests in the spec

– Group tests by module with a short description and a target coverage goal (e.g., “3 acceptance tests: happy path + 2 edge cases”).
– For each module, list the exact commands to run tests and the expected CI job name.
– Store minimal runnable examples (requests, fixture files, validator commands) in the repo path referenced by the spec so agents can execute them without extra context.

6. Practical tips to reduce flakiness

– Pin external dependencies or mock external services; include mock responses in fixtures.
– Limit reliance on timing / clocks; if time must be asserted, allow a window and normalize values before comparing.
– Capture and assert logs for critical errors rather than relying on subjective descriptions.

7. Example acceptance-test checklist (quick)

– Title ✔️
– Purpose ✔️
– Preconditions / setup commands ✔️
– Input (runnable) ✔️
– Precise assertions ✔️
– Pass/fail rule (binary) ✔️
– Automation command / CI job name ✔️
– Cleanup command (idempotence) ✔️

Following these rules turns the spec’s “Acceptance tests: list” line into executable, reviewable checks that let agents and humans verify work reliably and in parallel.

Sources

한국어