AI Reliability Engineering
state.stage = "production-ready"_

I design AI systems
that don't break.

The STATE framework is how.

02 / The Problem

Your GenAI system is not failing because of the model.

It is failing because of the plumbing around it.

01

Same input. Different outputs. No reproducible bugs.

Post-mortems end with the model did something weird. No root cause. No stack trace.

02

No traces. No per-user state. Flying blind.

Problems go unnoticed until users complain. Logging prompts and hoping is not observability.

03

Can we log why the agent did this? You cannot answer that.

Risk and legal need documentation. Law 25 requires it. Your system cannot answer in 30 minutes.

“Our GenAI stuff is basically a clever prototype duct-taped into production — non-deterministic, we cannot reproduce failures, and risk is breathing down our neck. I need a proper architecture for stateful, observable, auditable LLM systems so I stop betting my job on vibes.”

— LLMOps Lead, Financial Services, Quebec
03 / The STATE Framework

State Beats Intelligence.

A mid-tier model with proper state management beats a frontier model running stateless — every time.

Production-ready threshold

8–10

out of 10 STATE score

Structured

Explicit state schemas, not implicit context

Every operation initializes a typed state object. The stage field always reflects current execution position. If your agent crashed right now, could you look at the last saved state and know exactly where it stopped?

Traceable

Every step observable, every decision logged

Log every LLM call, external API call, and meaningful stage transition. You must be able to reconstruct exactly what the agent did, what it was given, and what it produced — for any execution, after the fact.

Auditable

Governance-ready, explainable under Law 25

For any automated decision affecting an individual, write a decision record. Quebec Law 25 requires it. So does OSFI. "Can we log why the agent did this?" must have a 30-minute answer.

Tolerant

Fault-tolerant and resumable after failure

When the workflow fails at step 6, it resumes from step 6 — not step 1. Lock before expensive operations. Clear lock on failure. If it only works moving forward, it's a demo.

Explicit

Deterministic boundaries, no magic

Every LLM output passes through a validation gate before any write or action. Invalid output routes to the error path — never silently continues. The seam between reasoning and action is always named.

Score Your System
Medium risk minimum: S + T + E required for all pipeline commands
04 / Who This Is For

You've been handed a GenAI platform.
Now you're accountable for reliability.

7–15 years in backend, data engineering, or SRE. Got pulled into GenAI platform ownership 1–2 years ago with an ambiguous mandate. Not an ML researcher. Came up through systems, not models.

LLMOps Engineer
GenAI Platform Advisor
Senior ML Engineer, LLM Infra
Senior Architect, GenAI Platform
AI Platform Lead
Staff Software Engineer (AI Platform)

Non-determinism in production

Same input, different outputs. Bugs cannot be reproduced.

No observability

No traces. No per-user state. Flying blind until users complain.

The compliance gap

Risk asks can we log why? Law 25 requires the answer.

Leadership pressure

100% feel pressure to ship GenAI. 90% say expectations are unrealistic.

05 / About

Practitioner, not guru.

I came up through backend and systems engineering, got pulled into GenAI platform work, and spent too long debugging failures that had nothing to do with the model.

The STATE framework is how I stopped guessing and started shipping reliably. It's not a research paper — it's what I use on real systems in regulated environments.

Montreal-based. Background in C#/.NET and distributed systems. Bilingual. Teaching what I learned the hard way.

Category
AI Reliability Engineering
Stack
C#/.NET, Python, TypeScript
Focus
Stateful, observable, auditable LLM systems
Location
Montreal, QC, bilingual
Framework
STATE (5-pillar production readiness)
Regulatory scope
Law 25, OSFI, EU AI Act
06 / How We Work Together

Start with the checklist.
Build from there.

Three entry points, one destination: a GenAI system that does not break under production conditions.

01Free

STATE Readiness Checklist

Score your GenAI system against 5 production-readiness pillars. 5 minutes. Concrete gaps, not vague advice. Know exactly where your system will fail before it does.

  • 5-pillar self-assessment
  • Scoring rubric with interpretation
  • The Reboot Test included
Download Free
02Free · April 2026

No Stack Trace

How to Make Agent Failures Reproducible

90 minutes. Live teardown of a real RAG architecture scored against the STATE framework. You'll leave with a reproducible debugging methodology, not just theory.

  • Live architecture teardown
  • STATE scoring exercise
  • Reproducibility methodology
Register
03Paid · May 2026

LLMOps Cohort

4 Weeks. Your System. Real Fixes.

A small cohort (10–12 engineers) working through the STATE framework on their actual production systems. You bring the system; we fix what's broken.

  • 4 weekly live sessions
  • Work on your real system
  • Law 25 compliance module
Join Waitlist

Free · 5 minutes

Is your GenAI pilot
production-ready?

Score it against the STATE framework. Concrete gaps, not vague advice. Know exactly where your system will fail before your users do.