Eval Harness

Eval Harness

Every Lokomotif module ships with at least one passing eval suite. The harness is the Python implementation that runs them.

Why evals are mandatory

A module without an eval is a hypothesis. Evals turn hypotheses into contracts: an authored module says “this is what the role looks like”; the eval says “this is the behavior we expect, and here’s how we’ll know if it drifts.” The Kit will not merge a module without an eval — the rule is enforced at PR review and through the schema’s source-material policy.

CLI

uv run lokomotif-eval run                    # every eval suite under modules/
uv run lokomotif-eval run --module contexts/finance/kvkk-compliance
uv run lokomotif-eval run --reporter json    # machine-readable output
uv run lokomotif-eval list                   # discovered suites
uv run lokomotif-eval scan-pii path/         # Turkey-aware PII scan

The same harness is reachable through the Node CLI as lokomotif eval run.

Eval YAML format

Each module under modules/ carries a sibling __tests__/<name>.eval.yaml:

module: roles/finance/your-role
description: 'Eval suite shape — replace with your role module.'
checks:
  - id: identity-mentions-domain
    judge: deterministic
    kind: regex
    target: /body/identity/en
    pattern: 'compliance|risk|review'
    flags: i
 
  - id: identity-not-empty-tr
    judge: deterministic
    kind: not_empty
    target: /body/identity/tr
 
  - id: voice-sounds-senior
    judge: llm
    target: /body/identity
    rubric: 'Identity should sound like a senior practitioner.'
    threshold: 0.7

Check kinds

judgekindRequired argsPass condition
deterministicregexpattern, optional flagsregex matches the target text
deterministicnot_emptytarget is non-null and non-empty
deterministicarray_lengthmin, maxarray length in bounds
deterministicequalsexpectedtarget equals the expected value
deterministiccontainssubstringtarget contains the substring
llmrubric, thresholdjudge score ≥ threshold

target is a JSON Pointer (RFC 6901) into the module — /body/identity/en, /body/expertise/0, and so on. Empty pointer "" or "/" returns the document root.

Judges

Deterministic

Pure code. Same input always produces the same output. Use them for shape-correctness and content-presence assertions.

LLM

The harness ships with StubLLMJudge — a deterministic keyword-overlap heuristic so CI runs without API keys. It identifies itself as stub in the result so reports never confuse it with a real LLM run.

Real LLM judges are implemented by the operator: a class implementing the LLMJudge protocol that calls Anthropic, OpenAI-compatible, or a local model server. The harness is vendor-neutral; you compose your preferred judge.

from lokomotif_eval import EvalRunner, LLMJudge
 
class AnthropicJudge:
    def evaluate(self, check, target_value):
        # ... call Anthropic, score, return JudgeResult
        ...
 
runner = EvalRunner(modules_dir=modules_dir, llm_judge=AnthropicJudge())
results, summary = runner.run_paths(pairs)

Reports

Two formats:

  • Console — human-readable, ✓/✗ markers, expanded reasons for failures, aggregate counts.
  • JSON — stable shape suitable for CI metadata wrappers; includes per-check duration and severity.

PII scanner

lokomotif-eval scan-pii <path> walks files and surfaces TC Kimlik candidates, Turkish IBAN, mobile numbers, and email addresses. The canonical specification is the guardrails/cross-industry/pii-tr module; the runtime patterns are aligned to it by hand for now.

Coverage

The harness’s own coverage budget is 90% on lines and functions. Eval-test coverage of modules is reported separately by the harness — every module’s eval suite must include checks for the load-bearing fields of its kind.

See also