Automation Failure Stories

Automation Failure Stories: How PRIME Would Have Prevented Disaster¶

This post is part of our ongoing series on network automation best practices, grounded in the PRIME Framework and PRIME Philosophy.

Transparency Note

Examples, scenarios, and any outcome figures in this article are provided for education and are based on enterprise delivery experience or anonymised composite scenarios unless explicitly identified as direct Nautomation Prime client outcomes.

Why This Blog Exists¶

Everyone loves a good war story—especially when there’s a lesson to be learned. Here are real-world automation failures, what went wrong, and how the PRIME Framework would have prevented disaster.

🚦 PRIME Philosophy: Learning from Failure¶

Transparency: Document what happened and why, share postmortems
Measurability: Track outcomes, failures, and recovery times
Ownership: Take responsibility for automation and its impact
Safety: Build in checks, validation, and rollback for every workflow
Empowerment: Share lessons so others don’t repeat mistakes, foster a blameless culture

Failure Story #1: The Unvalidated VLAN Push¶

Scenario: A script pushed VLAN changes to 500 switches in parallel—without validation. Half the network lost connectivity, causing a major outage.

Technical Breakdown:

# What went wrong: No validation, no error handling, no rollback
for device in devices:
  netmiko.ConnectHandler(...).send_config_set(["vlan 123"])  # blindly pushes config

Root Causes:

No pre-flight validation (didn't check if VLANs already existed or if devices were reachable)
No error handling or rollback (failures left devices in inconsistent states)
No change tracking (couldn't prove what changed or when)

How PRIME Would Have Helped:

Pre-flight checks (Pinpoint, Re-engineer)
Transactional changes with rollback (Implement)
Change tracking and reporting (Measure)

PRIME-Aligned Solution:

# PRIME: Validate, change, rollback, and log
for device in devices:
  if not validate_vlan_absent(device, vlan_id):
    log(f"{device}: VLAN {vlan_id} already exists, skipping")
    continue
  try:
    backup = get_running_config(device)
    push_vlan(device, vlan_id)
    log_change(device, vlan_id)
  except Exception as e:
    restore_config(device, backup)
    alert(f"Rollback on {device} due to: {e}")

Best Practices:

Always validate device state before making changes
Use dry-run or pre-checks to catch issues early
Build rollback and error handling into every script
Log every change for audit and troubleshooting

Failure Story #2: The Credential Leak¶

Scenario: A consultant hardcoded device passwords in a public Git repo. The credentials were scraped and used for unauthorized access, resulting in a security incident.

Technical Breakdown:

# What went wrong: Hardcoded secrets
DEVICE_PASSWORD = "SuperSecret123"  # committed to git!

Root Causes:

Hardcoded secrets in code
No credential management or rotation
No audit trail for secret access

How PRIME Would Have Helped:

Secure credential storage (Safety)
Audit logging and access control (Measurability)
Team training and onboarding (Empowerment)

PRIME-Aligned Solution:

# PRIME: Use a secrets manager and environment variables
import os
import hvac  # HashiCorp Vault client
client = hvac.Client(url=os.environ["VAULT_ADDR"])
creds = client.secrets.kv.read_secret_version(path="network/devices")
password = creds["data"]["data"]["password"]

Best Practices:

Never hardcode credentials—use environment variables or vaults
Rotate secrets regularly and audit access
Train all contributors on secure coding practices

Failure Story #3: The Untouchable Script¶

Scenario: A critical automation script was written by a contractor, undocumented and unmaintainable. When requirements changed, nobody could update it, leading to technical debt and business risk.

Technical Breakdown:

# What went wrong: No docs, no standards, no ownership
def do_everything():
  # ... 500 lines of magic ...
  pass

Root Causes:

No documentation or code comments
No knowledge transfer or onboarding
Vendor lock-in and lack of ownership

How PRIME Would Have Helped:

Inline documentation and code reviews (Transparency)
Knowledge transfer and runbooks (Empowerment)
Vendor-neutral, open-source design (Ownership)

PRIME-Aligned Solution:

# PRIME: Document, modularize, and share
def backup_device(device):
  """Backup device config and return path to backup file."""
  # ...

def push_config(device, config):
  """Push validated config to device."""
  # ...

# See docs/automation/backup.md for full workflow

Best Practices:

Document every script and workflow
Share knowledge through runbooks, wikis, and workshops
Avoid vendor lock-in by using open standards and tools

PRIME in Action: Turning Failure into Success¶

Document every failure and fix (blameless postmortems)
Build validation, rollback, and audit into every workflow
Share lessons learned with the team and community
Use failures as opportunities to improve processes and culture

Advanced Recovery Playbook¶

Immediate Triage: Stop automation, assess blast radius, and communicate transparently.
Root Cause Analysis: Use logs, version control, and device state to reconstruct what happened.
Rollback: Restore device state using backups or transaction logs.
Postmortem: Document the incident, fixes, and new safeguards. Share with the team.
Continuous Improvement: Update runbooks, scripts, and training to prevent recurrence.

Summary: Blog Takeaways¶

Every failure is a learning opportunity
PRIME principles prevent repeat mistakes and build resilient automation
Build transparency, safety, and ownership into every automation
Use technical controls (validation, rollback, secrets management) and cultural practices (blameless postmortems, documentation) together
Always test automation in a lab before production

Migrating Legacy Network Automation — Learn how to refactor and modernize old scripts to avoid common failure modes.
Deep Dive: CDP Network Audit — See how robust error handling and validation prevent outages.
Deep Dive: Access Switch Audit — Explore production-grade safety checks and rollback patterns.
Secure Credential Vaulting (Expert Tutorial) — Prevent leaks and enforce best practices.
AsyncIO in Network Automation (Expert Tutorial) — Avoid concurrency pitfalls and race conditions.

Automation Failure Stories