prompt-evaluator¶
Automated self-improvement: A/B tests Yeti's own prompts and files issues when a better variant is found.
| Property | Value |
|---|---|
| Type | Scheduled |
| Default hour | Midnight (schedules.promptEvaluatorHour) |
| Uses AI | Yes |
| Backend | Claude (configurable via jobAi) |
| Config key | schedules.promptEvaluatorHour |
Cost considerations
Each prompt-evaluator run makes 11 AI calls (test-input generation, variant generation,
8 A/B test runs, and a judge call). Running daily, this adds ~330 AI invocations per month
purely for prompt self-improvement — not productive work. Consider scheduling it less
frequently (e.g. weekly) via schedules.promptEvaluatorHour if AI usage costs are a concern,
or disable it entirely if you don't need automated prompt improvement.
What it does¶
The prompt-evaluator is a self-improvement mechanism for Yeti's plan-producing prompts. It reads the source code of registered prompt functions, generates an improved variant via AI, A/B tests both the current and variant prompts against synthetic GitHub issues, has AI judge the outputs, and files a GitHub issue (labeled prompt-improvement) when the variant wins convincingly.
Humans review and approve before any prompt change is applied -- no automatic modifications are made.
Prompt Registry¶
The evaluator rotates through a registry of prompt functions, evaluating one per run:
| Prompt | Source File | Purpose |
|---|---|---|
buildNewPlanPrompt |
src/jobs/issue-refiner.ts |
Produce an initial implementation plan from a GitHub issue |
buildRefinementPrompt |
src/jobs/issue-refiner.ts |
Refine an existing plan based on human feedback |
buildFollowUpPrompt |
src/jobs/issue-refiner.ts |
Answer follow-up questions while a PR is open |
buildReviewPrompt |
src/jobs/plan-reviewer.ts |
Critically review an implementation plan |
buildPrompt (issue-worker) |
src/jobs/issue-worker.ts |
Implement a solution based on an issue's plan |
State is persisted to ~/.yeti/prompt-eval-state.json so each run picks up the next prompt in the registry.
How it works¶
- Read prompt source -- Creates a worktree from the default branch and reads the source file containing the target prompt function
- Generate test inputs -- Asks AI to produce 4 test cases: 2 realistic GitHub issues (one well-specified, one vague) and 2 adversarial edge cases
- Generate variant -- Asks AI to analyze the current prompt for weaknesses and propose an improved version with a rationale
- A/B comparison -- Runs both the current prompt and the variant against all 4 test cases, collecting outputs
- Judge -- An AI judge scores each output pair on four criteria (1--5 scale):
- Specificity -- Does it reference concrete files, functions, or patterns?
- Actionability -- Could a developer implement from this output?
- Scope awareness -- Does it avoid over- or under-engineering?
- Uncertainty -- Does it flag ambiguity instead of guessing?
- Report -- If the variant wins at least 3 of 4 test cases, files a GitHub issue with the full evaluation report
Duplicate Prevention¶
Before filing an issue, the evaluator searches for existing open issues with the same title. If one already exists, it skips filing to avoid duplicates.
Output¶
When the variant wins, the filed issue includes:
- The rationale for the proposed change
- The full variant prompt text in a collapsible section, so reviewers can see the exact proposed change
- Per-test-case results with scores, winner, and reasoning
- Collapsible sections showing the full current and variant outputs
- The
prompt-improvementlabel for easy filtering
Related jobs¶
- issue-refiner -- Source of most evaluated prompts
- plan-reviewer -- Source of the review prompt
- issue-worker -- Source of the implementation prompt