Write the eval before you write the prompt
At Replace Works we stopped tuning prompts by vibes. Here is the small eval-first loop we run before any agent ships.
By Replace Works
Most teams write the prompt first and the eval never. They open a chat window, nudge the wording until the demo looks good, ship it, and then spend the next month reacting to weird outputs in production. We did this too. It does not scale, and it quietly erodes trust in the whole feature.
The fix that stuck for us is boring: write the eval before you write the prompt. Before touching a single instruction, we write down ten to thirty concrete input and output pairs that describe what good looks like. Some are happy paths. Most are the gnarly edge cases that actually break things: empty inputs, hostile inputs, ambiguous requests, and the long tail of real user phrasing. If we cannot describe success as a checkable example, we do not yet understand the task well enough to automate it.
Once the examples exist, the prompt becomes a search problem instead of an art project. We run the candidate prompt against the full set, score each output, and read the failures. The score is never the point. The failures are the point. They tell you exactly where the model misread the task, and they turn vague complaints like it feels off into a specific row you can stare at.
A few things we learned the hard way. First, keep the eval set in version control next to the code, not in a spreadsheet someone forgets about. When the eval lives in the repo, every prompt change shows up as a diff with a measurable effect, and you can block a merge that regresses the numbers. Second, mix automatic and human grading. Exact match and schema checks catch the cheap mistakes for free. For anything subjective, a rubric plus a second model as a judge gets you most of the way, but you still want a human to spot check the judge, because judges drift too.
Third, treat the eval set as a living asset. Every production incident becomes a new test case. The bug that woke someone up at 2am goes straight into the set so it can never come back silently. Over a few months this turns into the single most valuable artifact the team owns. New hires read it to understand the product. Model upgrades become a one command decision instead of a leap of faith.
The payoff shows up when you swap models. When a new model drops, we do not argue about whether it is better. We run the suite, look at the delta, and decide in an afternoon. The same harness that protects against prompt regressions makes provider migrations cheap, which matters more every quarter as the frontier keeps moving.
None of this requires a fancy platform. Our first version was a single script, a folder of JSON examples, and a CSV of scores. The discipline is what matters, not the tooling. Write the eval first, read the failures, fold incidents back in, and let the numbers gate the merge. Do that and prompt engineering stops feeling like guesswork and starts feeling like engineering.