Prompt Engineering Pipeline — How It Works

How the Pipeline Works

You talk. Three skills build, diagnose, and fix your prompt. Five tools track everything behind the scenes.

Step 1 — You say "build a prompt"

Architect

Asks you targeted questions in waves to understand what you need. Silently detects the prompt type, audits for gaps, and delivers a production-ready prompt with a scorecard showing what's strong and what was traded off.

"I need a prompt that gets Claude to write brand partnership reports for my creators — they need to feel data-driven but not robotic, and each one should be customised to the brand."

Library lookup Issue Pareto check Intent Block created

You say "evaluate"

Step 2 — Diagnosis

Evaluator

Scores the prompt across 4 dimensions (1–5 each). Finds specific, numbered issues — not vague opinions. Challenges the Architect's tradeoffs if they're costing real points. Adds historical context: "this issue type appears in 27% of evaluations."

Delivers: Scores | Numbered Issues (with severity + dimension) | Recommendations | Tradeoff Challenges

"Purpose Alignment: 4.0 — Structural Integrity: 2.5 — Completeness: 3.0 — Resilience: 2.0
Issue #1 [CRITICAL]: No output format specified. Issue #2 [MAJOR]: Generic persona..."

Issue frequency context Registry write Issue log auto-created

What carries from Evaluate to Refine

Scores PA 4.0, SI 2.5, C 3.0, R 2.0 — the Refiner knows exactly which dimensions need surgery and which are already strong.

Numbered Issues #1 [CRITICAL] No output format — #2 [MAJOR] Generic persona — each tagged with severity and the dimension it's dragging down.

Recommendations Priority-ordered fixes the Evaluator thinks will have the biggest score impact. The Refiner uses these as its starting plan.

Tradeoff Challenges Where the Architect made a deliberate tradeoff the Evaluator disputes — the Refiner decides whether to act on or preserve each one.

Issue History "This type has appeared in 27% of evaluations" — the Refiner knows which problems are chronic vs. one-off, and which resist fixing.

You say "refine" — all of this feeds in automatically. You don't copy anything.

▼

Step 3 — Surgery

Refiner

Reads the full Evaluation Report. Builds a plan before touching anything. Each change maps to a numbered issue. Predicts score improvements per dimension. Checks calibration history to correct for its own bias. Runs regression checks — won't break what works to fix what's broken. Signals when gains are exhausted.

Delivers: Refined Prompt | Change Log (issue-mapped) | Score Predictions | Annotated Diff

"Change #1 → Resolves Issue #1 [CRITICAL]: Added structured output format with 6 sections...
Change #2 → Resolves Issue #2 [MAJOR]: Replaced generic persona with domain-specific role...
Predicted: Overall 2.9 → 4.1 | Diminishing returns: GREEN"

Calibration bias check Issue history lookup Resolution data filled Diff generated

Re-evaluate? Loop back ↑

You're happy — accept

Step 4 — Done

Accept & Archive

Accept the prompt. It's archived in the Prompt Library with scores, intent, refinement history, and tags — so next time the Architect builds something in this domain, it has a reference point.

Library stored Session completed Calibration filled

5 Tools Working Behind the Scenes

Registry

Saves everything — every score, every issue, every version of your prompt. The single source of truth.

Phase 2 · 78 tests

Diff Engine

Shows exactly what changed between prompt versions, word by word, section by section.

Phase 1 · 57 tests

Score Tracker

Tracks how accurately the Refiner predicts scores. Over time, predictions get sharper.

Phase 3 · 58 tests (shared)

Prompt Library

Archives your best prompts. The Architect checks it before building — learns from what worked.

Phase 3 · 58 tests (shared)

Issue Tracker

Finds your blind spots. Shows which problems keep appearing and which resist fixing.

Phase 4 · 45 tests

238 Tests Total

Zero external dependencies. Pure Python. Every tool tested in isolation and in combination.

v0.4.0 · stdlib only

Example: Building a Creator Partnership Report Prompt

You: "I want to build a prompt that writes brand partnership reports for my creators." Architect checks the Library — finds a prior prompt in "brand partnerships" that scored 3.8. Checks Issue Tracker — top issues in this domain are "vague_instruction" and "missing_output_format." Addresses both proactively during questioning.

Architect delivers a production-ready prompt + Intent Block after 2-3 waves of questions. Scorecard shows all blocking checks pass, two advisory tradeoffs disclosed.

You: "Evaluate." Evaluator scores: PA 4.0, SI 2.5, C 3.0, R 2.0 — Overall 2.9. Finds 4 issues. Notes: "missing_output_format has appeared in 8 of the last 20 evaluations — this is a recurring pattern."

The Evaluation Report is now your handoff document

It contains: 4 dimension scores 4 numbered issues (each with severity, dimension, and a recommendation) tradeoff challenges and frequency context. When you say "refine," the Refiner reads all of this automatically — you don't need to copy or restate anything.

You: "Refine." Refiner reads the Evaluation Report. Builds a plan: Change #1 resolves Issue #1 [CRITICAL], Change #2 resolves Issue #2 [MAJOR]. Predicts Overall 2.9 → 4.1. Checks calibration — Refiner has been overpredicting Resilience by 0.4, adjusts down.

Optional: "Evaluate again" — Evaluator re-scores. Confirms 3 of 4 issues resolved. One persisted — marked in Issue Tracker with persistence data. Feeds back into calibration.

You: "Accept." — Prompt archived in Library. Scores, intent, and refinement history all stored. Next time you build in this domain, the pipeline starts smarter.