Suites & Running

Suites & Running

This page covers suite configuration, the test runner CLI, and CI/CD integration.

Suite Directory Structure

my-test-suite/
├── suite.yaml                    # Suite configuration
└── cases/
    ├── github-delete-tests.yaml  # Test cases grouped by policy
    ├── credential-tests.yaml
    └── response-redaction.yaml

Suite Configuration

Full annotated suite.yaml:

version: "v1"
bundle_id: "my-policies"
description: "Tests for production policy set"

providers:
  openai:
    api_key: "${OPENAI_API_KEY}"
  anthropic:
    api_key: "${ANTHROPIC_API_KEY}"

policies:
  cel_request_rules: "../cel_request_rules.yaml"
  ai_request_rules: "../ai_request_rules.yaml"
  cel_response_rules: "../cel_response_rules.yaml"
  ai_response_rules: "../ai_response_rules.yaml"

acceptance:
  min_match_rate: 1.0
  strict_policy_match: true

execution:
  timeout_ms: 30000
  retries: 2
  retry_delay_ms: 1000
  rate_limits:
    openai:
      requests_per_minute: 60
    anthropic:
      requests_per_minute: 30

engines:
  cel:
    enabled: true
  ai:
    enabled: true
    model_matrix:
      - provider: openai
        model: gpt-4o-mini
        enabled: true
      - provider: anthropic
        model: claude-sonnet-4-5-20250929
        enabled: true

Configuration Fields

FieldRequiredDescription
versionYesSchema version ("v1")
bundle_idYesUnique identifier for this test suite
descriptionNoHuman-readable description
providersYesAI provider credentials
policiesYesPaths to policy files (relative to suite directory)
acceptance.min_match_rateNoMinimum pass rate, 0.0-1.0 (default: 1.0)
acceptance.strict_policy_matchNoWhether unexpected policy triggers fail (default: true)
execution.timeout_msNoPer-test timeout in milliseconds (default: 30000)
execution.retriesNoNumber of retries for failed AI evaluations (default: 2)
execution.retry_delay_msNoDelay between retries (default: 1000)
execution.rate_limitsNoPer-provider rate limiting
engines.cel.enabledNoEnable CEL testing (default: true)
engines.ai.enabledNoEnable AI testing (default: true)
engines.ai.model_matrixNoModels to test against

Running Tests

Basic Commands

# Run all tests
maybe-dont test policies --suite-dir ./suite

# CEL only (fast, no API calls)
maybe-dont test policies --suite-dir ./suite --engine cel

# Single AI model
maybe-dont test policies --suite-dir ./suite --model openai:gpt-4o-mini

# Full model matrix comparison
maybe-dont test policies --suite-dir ./suite --matrix

# Filter by tags
maybe-dont test policies --suite-dir ./suite --tags "credentials,deny"

# Validate suite without running (check for schema errors)
maybe-dont test policies --suite-dir ./suite --validate-only

# Show cached results without re-running
maybe-dont test policies --suite-dir ./suite --summary-only

Incremental Execution

For large test suites or rate-limited APIs:

FlagDescription
--incrementalSkip unchanged tests, persist state
--fullRun all tests but persist state for next incremental run
--retry-failedRe-run failed/errored tests even if cached
--waitRun continuously until all tests complete (respects rate limits)
--max-tests NLimit tests per invocation (exit code 5 if more remain)

State is persisted to ~/.local/state/maybe-dont/policy-test-state.json by default.

# Run up to 10 tests per invocation (rate-limit friendly)
maybe-dont test policies --suite-dir ./suite --incremental --max-tests 10

# Keep running until everything passes
maybe-dont test policies --suite-dir ./suite --incremental --wait

# Re-run only the tests that failed last time
maybe-dont test policies --suite-dir ./suite --incremental --retry-failed

Output Formats

FormatFlagDescription
Text (default)Pass/fail/error/skip per test to stdout
JSON--output results.jsonStructured results with per-model breakdowns
JUnit XML--format junit --output results.xmlFor CI test reporting
Quiet--quietSuppress stdout (useful with --output)
# JSON output for programmatic use
maybe-dont test policies --suite-dir ./suite --output results.json

# JUnit XML for CI integration
maybe-dont test policies --suite-dir ./suite --format junit --output results.xml

# Quiet mode with file output
maybe-dont test policies --suite-dir ./suite --quiet --output results.json

Exit Codes

CodeMeaning
0All tests passed, thresholds met
1Test failure (thresholds not met)
2Schema validation error
3Policy integrity error (referenced policy doesn’t exist)
4Path resolution error
5More tests remain (with --max-tests)

Model Comparison

When using --matrix, the runner outputs a comparison table showing pass/fail/match rate per model:

Model Matrix Results
─────────────────────────────────────────────────────
Model                           Pass  Fail  Match %
─────────────────────────────────────────────────────
openai:gpt-4o-mini              18    2     90.0%
openai:gpt-4o                   20    0     100.0%
anthropic:claude-sonnet-4-5     19    1     95.0%
─────────────────────────────────────────────────────

Use this to find the model that meets your accuracy bar at the best cost.

CI/CD Integration

Fast Feedback on Every Commit

Run CEL tests on every push — they’re instant and free:

# GitHub Actions example
- name: Test CEL policies
  run: maybe-dont test policies --suite-dir ./suite --engine cel

Nightly Model Accuracy

Run the full model matrix on a schedule:

- name: Test AI policies (matrix)
  run: maybe-dont test policies --suite-dir ./suite --matrix --format junit --output results.xml

Rate-Limit-Friendly CI

For large test suites with rate-limited APIs:

- name: Test policies (incremental)
  run: maybe-dont test policies --suite-dir ./suite --incremental --max-tests 50

CI Best Practices

  • Use --engine cel for fast feedback on every commit
  • Use --matrix in a nightly or weekly job for model accuracy tracking
  • Use --format junit for CI test reporting integration
  • Use --incremental --max-tests N for rate-limit-friendly CI runs
  • Set min_match_rate appropriately — 1.0 for CEL, 0.95+ for AI is a reasonable starting point