Designing Automation That Can Safely Fail
Failure Is Not Optional¶
All production automation eventually encounters:
- Device timeouts
- Partial command execution
- Upstream dependency outages
- Unexpected platform behaviour
The core question is not whether failure occurs. It is whether your workflow fails in a safe direction.
Fail-Closed vs Fail-Open¶
Use policy-driven choices:
- Fail-closed for change correctness, identity checks, and security controls
- Fail-open only for non-critical observability or advisory outputs
If uncertain, default to fail-closed for write paths.
Safe Abort Conditions¶
Define clear hard-stop triggers:
- Identity mismatch
- Privilege mismatch
- Pre-flight critical failure
- Unexpected parser ambiguity in required checks
- Error-rate threshold exceeded in current batch
Hard stops should be deterministic and tested.
Degradation Pattern¶
When non-critical components fail:
- Continue only if risk model allows it
- Mark run state as degraded
- Increase logging and operator visibility
- Block promotion to broader scope
Graceful degradation is useful only when safety invariants still hold.
Production Checklist¶
- Failure policy is explicit per workflow phase
- Hard-stop triggers are codified and versioned
- Error-rate thresholds are enforced in runtime
- Degraded mode behaviour is documented and observable
- Human escalation paths are clear and tested
Anti-Patterns¶
- Catch-all exception handling that suppresses critical faults
- Continuing writes after parser uncertainty
- Defining failure policy only in runbooks, not code
- Treating failed post-validation as minor warning