Why AI Trip Planners Spit Out Useless Itineraries — and the Constraint Layer That Fixes It

Another first-hand build log — this one from a different app than our DeepSeek series. We designed and shipped the AI itinerary feature described here. Patterns that worked for us, not gospel.

Ask any LLM for “a 3-day itinerary for Rome” and you’ll get something instantly. It’ll look great. It’ll also be subtly useless: it ignores that you’re traveling with a 4-year-old and a grandparent, it schedules three museums back-to-back, and it assumes everyone walks 20,000 steps a day. We built an AI trip-planning feature and learned this the hard way: the text generation was the easy 10%. The hard 90% is the constraint layer. Here’s what that means and how we built it.

The trap: a great-sounding plan nobody can follow

Our first version did exactly what every demo does — fed the destination and dates to the model and printed the result. Users loved it for about one trip. Then the feedback rolled in: “Day 2 had us crossing the city four times.” “No way the kids last through that.” “It put a hike on the day grandma was with us.”

The plans weren’t wrong. They were unconstrained. A travel itinerary isn’t a creative writing task; it’s a scheduling problem with human limits. The LLM is brilliant at the prose and clueless about the limits — because we hadn’t told it the limits in a form it could respect.

What the constraint layer actually is

The constraint layer is the structured set of non-negotiables you compute before you ever call the model, and then enforce after. For group travel, ours covers four families of constraints:

Who’s in the group. Kids, elderly, mixed-fitness, multiple families traveling together. Each implies hard limits: nap windows, max walking distance, bathroom/rest cadence, “nothing starts before 9am.”
Pacing. A real day has a budget of energy, not just hours. We cap activities-per-day and force a rest block. The model loves to fill every slot; the constraint layer won’t let it.
Geography. Group nearby things together, minimize backtracking. This is a routing problem, and it’s the one LLMs are worst at — they’ll happily zigzag across a city.
Logistics reality. Opening hours, “this needs a booking,” travel time between stops. The stuff that turns a pretty list into a plan you can actually execute.

None of that is the model’s job. It’s our job to compute these and hand them over.

The pattern: constrain before, enforce after

The architecture that finally worked is a sandwich, with the LLM in the middle:

Before the call — translate group facts into hard rules. “Family with a toddler + one grandparent” becomes concrete parameters: max ~3 activities/day, walking radius capped, a protected midday rest, nothing strenuous on shared-with-elderly days. These go into the prompt as explicit constraints, not vibes.

The call — let the model do what it’s good at. Given tight constraints, the LLM is excellent at the human layer: suggesting which kid-friendly stop, writing the warm one-line “why you’ll like this,” sequencing within the rules. We lean on it for judgment and language, not for math.

After the call — validate and repair. We don’t trust the output blindly. A checker re-reads the generated plan against the same constraints: too many stops? a strenuous item on a low-energy day? obvious backtracking? If it fails, we either auto-repair (drop/swap the offending item) or send it back to the model with the specific violation called out. The disclaimer of itinerary-building: never ship an LLM plan you haven’t validated against your own rules.

This before/after sandwich is the same shape as the guardrail work in Part 2 of our other build log — and that’s not a coincidence. Whenever an LLM output has real-world consequences, you constrain the input and validate the output. The middle is the easy part.

The “avoid these” list did more than any clever prompt

The single highest-value feature wasn’t a generation trick — it was a “common mistakes to avoid” block baked into every plan: don’t over-schedule, build in buffer time, group by area, check opening hours, have a rainy-day backup. Users told us this was the part that felt like advice from someone who’d actually traveled with kids, not a brochure. It costs almost nothing to add and it’s the thing people screenshot.

Lesson: the value isn’t the itinerary, it’s the judgment around it. The raw list is a commodity any chatbot produces. The constraints, the pacing sanity, and the “here’s what trips people up” — that’s the product.

Token discipline still applies

Same lesson as the rest of our build logs: the generation is cheap, so spend the savings on structure.

Compute constraints in code, not in tokens. Don’t make the model derive “toddler ⇒ shorter days.” Derive it yourself and state it.
Generate the skeleton once, enrich on demand. We don’t regenerate the whole trip when a user tweaks one day. Cheaper and more stable.
Cap output length and structure it. A fixed per-day shape (morning / midday rest / afternoon / evening + a tip) is scannable for users and validatable by our checker.

If you’re building anything that generates a plan

Itinerary, meal plan, study schedule, workout program — they’re all the same problem wearing different clothes. The LLM makes the prose effortless and lulls you into thinking you’re done. You’re not. The product is the constraint layer: encode the human limits before you generate, validate against them after, and surface the “what people get wrong” judgment that a generic model never volunteers.

The model is the cheapest, most replaceable part of the whole thing. The constraints are yours, and they’re the moat.

More from our build logs: the cheap stack & real costs, the prompt + guardrail layer, and the $0 Flask + SQLite backend.