Prompt Engineering Pipeline — How It Works

How the Pipeline Works

You talk. Three skills build, diagnose, and fix your prompt. Five tools track everything behind the scenes.

Step 1 — You say "build a prompt"
Architect
Asks you targeted questions in waves to understand what you need. Silently detects the prompt type, audits for gaps, and delivers a production-ready prompt with a scorecard showing what's strong and what was traded off.
"I need a prompt that gets Claude to write brand partnership reports for my creators — they need to feel data-driven but not robotic, and each one should be customised to the brand."
Library lookup Issue Pareto check Intent Block created
You say "evaluate"
Step 2 — Diagnosis
Evaluator
Scores the prompt across 4 dimensions (1–5 each). Finds specific, numbered issues — not vague opinions. Challenges the Architect's tradeoffs if they're costing real points. Adds historical context: "this issue type appears in 27% of evaluations."
Delivers: Scores  |  Numbered Issues (with severity + dimension)  |  Recommendations  |  Tradeoff Challenges
"Purpose Alignment: 4.0 — Structural Integrity: 2.5 — Completeness: 3.0 — Resilience: 2.0
Issue #1 [CRITICAL]: No output format specified. Issue #2 [MAJOR]: Generic persona..."
Issue frequency context Registry write Issue log auto-created
What carries from Evaluate to Refine
Scores PA 4.0, SI 2.5, C 3.0, R 2.0 — the Refiner knows exactly which dimensions need surgery and which are already strong.
Numbered Issues #1 [CRITICAL] No output format — #2 [MAJOR] Generic persona — each tagged with severity and the dimension it's dragging down.
Recommendations Priority-ordered fixes the Evaluator thinks will have the biggest score impact. The Refiner uses these as its starting plan.
Tradeoff Challenges Where the Architect made a deliberate tradeoff the Evaluator disputes — the Refiner decides whether to act on or preserve each one.
Issue History "This type has appeared in 27% of evaluations" — the Refiner knows which problems are chronic vs. one-off, and which resist fixing.
You say "refine" — all of this feeds in automatically. You don't copy anything.
Step 3 — Surgery
Refiner
Reads the full Evaluation Report. Builds a plan before touching anything. Each change maps to a numbered issue. Predicts score improvements per dimension. Checks calibration history to correct for its own bias. Runs regression checks — won't break what works to fix what's broken. Signals when gains are exhausted.
Delivers: Refined Prompt  |  Change Log (issue-mapped)  |  Score Predictions  |  Annotated Diff
"Change #1 → Resolves Issue #1 [CRITICAL]: Added structured output format with 6 sections...
Change #2 → Resolves Issue #2 [MAJOR]: Replaced generic persona with domain-specific role...
Predicted: Overall 2.9 → 4.1 | Diminishing returns: GREEN"
Calibration bias check Issue history lookup Resolution data filled Diff generated
Re-evaluate? Loop back ↑
You're happy — accept
Step 4 — Done
Accept & Archive
Accept the prompt. It's archived in the Prompt Library with scores, intent, refinement history, and tags — so next time the Architect builds something in this domain, it has a reference point.
Library stored Session completed Calibration filled

5 Tools Working Behind the Scenes

Registry

Saves everything — every score, every issue, every version of your prompt. The single source of truth.

Phase 2 · 78 tests

Diff Engine

Shows exactly what changed between prompt versions, word by word, section by section.

Phase 1 · 57 tests

Score Tracker

Tracks how accurately the Refiner predicts scores. Over time, predictions get sharper.

Phase 3 · 58 tests (shared)

Prompt Library

Archives your best prompts. The Architect checks it before building — learns from what worked.

Phase 3 · 58 tests (shared)

Issue Tracker

Finds your blind spots. Shows which problems keep appearing and which resist fixing.

Phase 4 · 45 tests

238 Tests Total

Zero external dependencies. Pure Python. Every tool tested in isolation and in combination.

v0.4.0 · stdlib only

Example: Building a Creator Partnership Report Prompt

1
You: "I want to build a prompt that writes brand partnership reports for my creators." Architect checks the Library — finds a prior prompt in "brand partnerships" that scored 3.8. Checks Issue Tracker — top issues in this domain are "vague_instruction" and "missing_output_format." Addresses both proactively during questioning.
2
Architect delivers a production-ready prompt + Intent Block after 2-3 waves of questions. Scorecard shows all blocking checks pass, two advisory tradeoffs disclosed.
3
You: "Evaluate." Evaluator scores: PA 4.0, SI 2.5, C 3.0, R 2.0 — Overall 2.9. Finds 4 issues. Notes: "missing_output_format has appeared in 8 of the last 20 evaluations — this is a recurring pattern."
The Evaluation Report is now your handoff document

It contains: 4 dimension scores 4 numbered issues (each with severity, dimension, and a recommendation) tradeoff challenges and frequency context. When you say "refine," the Refiner reads all of this automatically — you don't need to copy or restate anything.

4
You: "Refine." Refiner reads the Evaluation Report. Builds a plan: Change #1 resolves Issue #1 [CRITICAL], Change #2 resolves Issue #2 [MAJOR]. Predicts Overall 2.9 → 4.1. Checks calibration — Refiner has been overpredicting Resilience by 0.4, adjusts down.
5
Optional: "Evaluate again" — Evaluator re-scores. Confirms 3 of 4 issues resolved. One persisted — marked in Issue Tracker with persistence data. Feeds back into calibration.
6
You: "Accept." — Prompt archived in Library. Scores, intent, and refinement history all stored. Next time you build in this domain, the pipeline starts smarter.