The $0 Backend: Flask + SQLite Rate-Limiting and Audit Logs for an AI App

Third in our build-log series, after the stack & costs and the prompt + guardrail layer. Same rule as before: this is first-hand — it’s how our app actually runs today, not a tutorial copied from somewhere.

Everyone wiring an LLM into a product hits the same fork: the model call is trivial, but the moment real users arrive you need rate-limiting (so one person can’t drain your token budget), quotas (so “free” stays affordable), and audit logs (so you can see what happened when something breaks). The internet will tell you to reach for Redis and a managed Postgres. We didn’t. Here’s the boring stack that has never let us down — and the reasoning, so you can decide if it fits you.

The whole backend, in one breath

Flask + gunicorn, one process, behind nginx.
One SQLite file (in WAL mode) for quotas, rate-limit counters, and audit logs.
No accounts required — we identify anonymous users by a lightweight fingerprint + IP, not a login.
systemd keeps it alive; Let’s Encrypt gives free TLS.
No Redis. No managed database. No Docker.

That’s it. It serves real traffic and the monthly bill is a rounding error. Now the why.

Why SQLite instead of Redis + Postgres

The honest answer: at our scale, a single file is faster to reason about and impossible to misconfigure. Redis and a managed DB are two more services to secure, back up, pay for, and debug at 2am.

The thing people get wrong about SQLite is assuming it can’t handle concurrency. With WAL (Write-Ahead Logging) mode turned on, readers don’t block writers and writers don’t block readers. For a workload that’s mostly small reads and occasional small writes — which is exactly what quota/rate-limit/audit logging is — it’s plenty.

The rule we use: SQLite until a single box genuinely can’t keep up. We’re nowhere near that ceiling, and most solo apps never reach it. If you do, that’s the day you add a real DB — not before.

Rate-limiting without Redis

The textbook reason for Redis is atomic counters. You can get the same guarantee from SQLite if you let the database do the work instead of your Python code:

Do the increment-and-check inside a single transaction, so two simultaneous requests can’t both “win” the last allowed call. The database serializes it for you.
Use a fixed or sliding window keyed by (fingerprint, day). A row per user per day, incremented per request, compared against the limit. Simple to read, trivial to debug.
Fail closed on the budget, open on errors. If the limit is hit, refuse politely. But if the limiter itself errors, we let the request through and log it — better a rare over-serve than blocking a real user because of our bug.

No background service, no extra network hop. The counter lives in the same file as everything else.

Quotas that keep “free” actually free

Our principle (it predates this site): the free tier must be fully usable on its own, and paid is an enhancement, never a crippled free. But “free” calling a paid model API still costs us real tokens — so quotas exist to keep the public-good promise sustainable, not to nag people into upgrading.

What that looks like in the schema:

A tier field on the user/fingerprint record from day one. Add this even if everything is free today — retrofitting a plan column after launch is miserable.
Per-tier daily limits, read from config, not hard-coded in handlers.
A clear, friendly message when a limit is reached — what the limit is, when it resets — never a dead end.

Audit logs: the cheapest insurance you’ll ever buy

This is the part solo builders skip and later regret. Every AI call writes a small audit row: a timestamp, the anonymous fingerprint, which endpoint, token counts, and whether a safety rule fired (remember the red-flag handling from Part 2). We deliberately do not store the user’s raw sensitive content — only what we need to debug and to prove the system behaved.

Why it’s worth the few lines of code:

Debugging: when someone reports “it gave me a weird answer,” you can actually see the shape of what happened.
Cost forensics: a sudden token spike has a paper trail — which endpoint, which pattern of use.
Safety posture: being able to show that urgency cues fired and disclaimers were attached is the difference between a defensible tool and a hopeful one.

Logs are append-only, rotate by date, and a tiny scheduled job prunes old rows so the file never bloats.

The one upgrade we did make: a separate scheduler

The single place we split things out: long-running and timed jobs run in their own process, not inside the web workers. Data refreshes, log pruning, anything periodic — a separate systemd-managed scheduler. This keeps the web tier stateless and restart-safe: we can redeploy the app without interrupting a background job, and a stuck job can’t take request handling down with it. That separation, not more infrastructure, is what bought us reliability.

Would we change it? No — and here’s the honest caveat

If you’re serving millions of writes a second, none of this applies — go get your distributed datastore. But if you’re a solo builder or a small team shipping an AI tool, the temptation to pre-build for a scale you don’t have is the actual enemy. Every service you add is something that can page you at night.

We run a real AI app on one Flask process and one SQLite file, and the limiting factor has never once been the database. The boring stack didn’t just save money — it saved attention, which for a solo maker is the scarcer resource.

That closes the loop on our build log: cheap by the token, safe by the prompt, and reliable by refusing complexity. If there’s a part of the stack you want us to go deeper on, tell us — we’ll write the one people ask for most.