The Prompt + Guardrail Layer: What Actually Took the Time on Our DeepSeek App

This is the follow-up to our first DeepSeek build log. Same caveat: this is first-hand — we shipped and still run the app described. We’re sharing patterns that worked for us, not universal truths.

In Part 1 we said the model was cheap and the guardrails were the work. This post is the receipt for that claim. If you’re shipping anything user-facing on an LLM API — health, legal, finance, education, anything where being wrong has a cost — the prompt layer is where you spend your real time. Here’s the structure that finally stuck for us.

The five-section system prompt that survived production

We went through many shapes before settling on this. Almost every system prompt we now write follows the same skeleton:

Role — what the assistant is, in one sentence. Not a personality, a function. (“You are an information assistant that helps users figure out which type of clinic to consider for a described symptom.”)
Scope — what it must do and must not do, explicitly. The “must not” list is more important than the “must” list.
Output contract — exact structure of every reply. Sections, length, when to include a disclaimer. Treat this as an API spec, not a suggestion.
Safety rules — red-flag triggers, refusal language, escalation guidance. We put these inside the system prompt, not as a separate post-processing step (more on why below).
Examples — 2–3 short examples covering: a normal case, an out-of-scope case, a red-flag case. These do more work than any amount of instruction prose.

Order matters. We learned to put the negative rules (what not to do) before the examples, because the model tends to mimic the examples and ignore later instructions that contradict them.

The refusal rules that actually hold

The first version of our prompt said “do not diagnose.” The model still diagnosed. We rewrote it three times before it stopped. What worked:

Refuse by function, not by topic. Saying “you are not a doctor” is weaker than saying “never name a specific disease as the cause; only suggest a category of clinic to visit.”
Give it a sanctioned alternative for every refusal. A model told only “don’t do X” will twist itself trying to almost-do X. A model told “instead of X, do Y” will reliably do Y.
Restate the boundary in the output contract. Belt and braces: the safety rule lives in the safety section and the output template says “end with: ‘This is information only, not medical advice. When in doubt, see a clinician.’”

That last one — making the disclaimer part of the required output shape rather than a behavioral request — is the single highest-leverage change we made.

Red flags: flag, never block

Our most important rule, written in capitals in our own notes: the assistant never blocks a user from acting on urgency. It also never replaces a professional.

Concretely, the system prompt holds a small list of red-flag patterns. When matched, the response template changes: the first line becomes a plainly-worded urgency cue (“This may need urgent attention — please consider calling emergency services or going to an ER now”), and the normal informational reply follows underneath. We don’t refuse, we don’t hide the user’s content, we don’t paywall. We add a line and continue.

Two reasons:

Blocking is dangerous. A user describing chest pain who gets an “I can’t help with that” wall is worse off than one who gets a clear “go now” + the info they asked for.
Liability follows behavior. A tool that consistently surfaces urgency and points to professional help is in a defensible posture; one that pretends to diagnose isn’t.

Output discipline: same shape every time

We force every reply into the same skeleton:

One-line summary of what the user described, in their own words.
A short suggested direction (the actual answer — e.g. relevant clinic category and why).
A what-to-tell-the-doctor block (this is the part users tell us they love — it turns the AI from “answer machine” into “prep tool”).
The mandatory disclaimer, verbatim.

Why the rigid shape:

Users learn to scan it in seconds — predictability is a UX feature, not a constraint.
It makes regression testing trivial — we can grep for “missing disclaimer” or “summary too long” automatically.
It bounds the worst-case output. A free-form LLM will occasionally write three paragraphs of speculation; a templated one can’t.

Token discipline that actually moves the needle

We obsessed about this for a week and learned the boring truth: the savings are in the system prompt, not the user turn.

Cache the system prompt. DeepSeek (and most major APIs) charge dramatically less for repeated prompt prefixes. If your system prompt is 1,500 tokens and serves 10,000 sessions a day, prompt caching is the single biggest cost lever you have. Use it.
Trim ruthlessly, but trim examples last. Cutting instructional prose by 30% almost never hurt us. Cutting one good example often did.
Cap conversation history. We keep the last N user turns, not the whole thread. Quality stayed flat; cost stopped creeping.
Set max_tokens low and force concise output in the prompt. “Reply in under 180 words” in the prompt + a hard token cap = predictable bills.

The compounding effect is large in percentages, even when the absolute dollars are tiny. Get into the habit early; it scales.

How we actually test the prompt

You can’t unit-test creativity, but you can unit-test guardrails. We keep a small adversarial test set — about 30 inputs — and re-run them whenever we change the prompt or model version. The set includes:

Direct asks to diagnose (“Do I have X?”)
Out-of-scope drift (“Write me a poem about this”)
Red-flag wordings, including subtle ones
Multilingual / typo-heavy / very short inputs
Prompt-injection attempts (“Ignore previous instructions…”)

For each, we don’t grade the answer; we grade whether the rules held: did it refuse to name a disease? did it include the disclaimer? did it surface urgency when it should? That’s a green/red checklist. A prompt change that fails the checklist gets rolled back, full stop.

Five things we’d tell ourselves on day one

Spend 80% of your time on the system prompt and 20% on everything else. It’s the inverse of what feels right.
Output structure beats output cleverness. A boring, predictable shape wins.
Bake disclaimers into the template, not the behavior. Don’t ask the model to remember; force the shape to include it.
Adversarial tests aren’t optional. Without them, every prompt edit is a guess.
Cache aggressively; cap context. The bill is shaped by the system prompt, not the chat.

The takeaway is the same as Part 1, just sharper: renting intelligence by the token is cheap. Making it behave is the job. Spend your saved infra dollars on writing — and testing — the words at the top of every call.

Next in the series: how we wired DeepSeek into a tiny Flask backend with rate-limiting and audit logs on SQLite — no Redis, no managed DB, and why we still wouldn’t change it.