state.stage = "production-ready"_I design AI systems
that don't break.
The STATE framework is how.
Your GenAI system is not failing because of the model.
It is failing because of the plumbing around it.
Same input. Different outputs. No reproducible bugs.
Post-mortems end with the model did something weird. No root cause. No stack trace.
No traces. No per-user state. Flying blind.
Problems go unnoticed until users complain. Logging prompts and hoping is not observability.
Can we log why the agent did this? You cannot answer that.
Risk and legal need documentation. Law 25 requires it. Your system cannot answer in 30 minutes.
“Our GenAI stuff is basically a clever prototype duct-taped into production — non-deterministic, we cannot reproduce failures, and risk is breathing down our neck. I need a proper architecture for stateful, observable, auditable LLM systems so I stop betting my job on vibes.”
— LLMOps Lead, Financial Services, Quebec
State Beats Intelligence.
A mid-tier model with proper state management beats a frontier model running stateless — every time.
Production-ready threshold
8–10
out of 10 STATE score
Structured
Explicit state schemas, not implicit context
Every operation initializes a typed state object. The stage field always reflects current execution position. If your agent crashed right now, could you look at the last saved state and know exactly where it stopped?
Traceable
Every step observable, every decision logged
Log every LLM call, external API call, and meaningful stage transition. You must be able to reconstruct exactly what the agent did, what it was given, and what it produced — for any execution, after the fact.
Auditable
Governance-ready, explainable under Law 25
For any automated decision affecting an individual, write a decision record. Quebec Law 25 requires it. So does OSFI. "Can we log why the agent did this?" must have a 30-minute answer.
Tolerant
Fault-tolerant and resumable after failure
When the workflow fails at step 6, it resumes from step 6 — not step 1. Lock before expensive operations. Clear lock on failure. If it only works moving forward, it's a demo.
Explicit
Deterministic boundaries, no magic
Every LLM output passes through a validation gate before any write or action. Invalid output routes to the error path — never silently continues. The seam between reasoning and action is always named.
You've been handed a GenAI platform.
Now you're accountable for reliability.
7–15 years in backend, data engineering, or SRE. Got pulled into GenAI platform ownership 1–2 years ago with an ambiguous mandate. Not an ML researcher. Came up through systems, not models.
Non-determinism in production
Same input, different outputs. Bugs cannot be reproduced.
No observability
No traces. No per-user state. Flying blind until users complain.
The compliance gap
Risk asks can we log why? Law 25 requires the answer.
Leadership pressure
100% feel pressure to ship GenAI. 90% say expectations are unrealistic.
Practitioner, not guru.
I came up through backend and systems engineering, got pulled into GenAI platform work, and spent too long debugging failures that had nothing to do with the model.
The STATE framework is how I stopped guessing and started shipping reliably. It's not a research paper — it's what I use on real systems in regulated environments.
Montreal-based. Background in C#/.NET and distributed systems. Bilingual. Teaching what I learned the hard way.
Start with the checklist.
Build from there.
Three entry points, one destination: a GenAI system that does not break under production conditions.
STATE Readiness Checklist
Score your GenAI system against 5 production-readiness pillars. 5 minutes. Concrete gaps, not vague advice. Know exactly where your system will fail before it does.
- 5-pillar self-assessment
- Scoring rubric with interpretation
- The Reboot Test included
No Stack Trace
How to Make Agent Failures Reproducible
90 minutes. Live teardown of a real RAG architecture scored against the STATE framework. You'll leave with a reproducible debugging methodology, not just theory.
- Live architecture teardown
- STATE scoring exercise
- Reproducibility methodology
LLMOps Cohort
4 Weeks. Your System. Real Fixes.
A small cohort (10–12 engineers) working through the STATE framework on their actual production systems. You bring the system; we fix what's broken.
- 4 weekly live sessions
- Work on your real system
- Law 25 compliance module
Free · 5 minutes
Is your GenAI pilot
production-ready?
Score it against the STATE framework. Concrete gaps, not vague advice. Know exactly where your system will fail before your users do.