A Redis Job Queue Was Enough
When we started running embeddings, chatbot replies, certificate generation, and translation jobs out of the request path, the obvious choice — the one every architecture deck pushes — was Kafka. Durable log, replay, consumer groups, the whole story.
We didn’t pick it.
Numbers, not vibes
We measured. The peak across all background work was roughly 40 jobs per second, with a heavy long tail (LLM calls dominate the time). The average backlog at peak was under 500. Lossy jobs are fine — if a certificate render fails we re-queue it on the next request.
For that workload, Kafka is theatre. We don’t need replay. We don’t need to fan out to multiple consumer groups. We don’t need exactly-once semantics — at-least-once is correct, and our jobs are idempotent.
What we do need:
- A persistent queue per job type.
- Visible retries with backoff.
- A dashboard so on-call can see what’s stuck.
- One fewer service to operate.
What Redis + BullMQ gave us
- We were already running Redis for session storage and rate limiting, so the operational footprint stayed the same.
- BullMQ has a dashboard out of the box.
- Job state is queryable with normal Redis commands. Debugging is
KEYSandHGETALL, not consumer-group offsets. - Latency from enqueue to worker pickup is in the 1–5 ms range. That matters for jobs the user is waiting on.
What we’d switch for
If we ever hit:
- A queue depth that doesn’t drain — meaning we need multiple worker pools competing for the same stream.
- A need to replay weeks of events to rebuild a downstream system.
- Cross-team consumers reading the same event.
Then Kafka (or NATS JetStream, which we already run) earns its weight. Until then, the simplest thing that handles the load is the right thing.
The lesson
The capacity gap between “what your stack can handle” and “what real companies need” is wider than most architecture posts admit. Count jobs per second before picking infrastructure.