Skip to content

Automation Failure Stories

Automation Failure Stories: How PRIME Would Have Prevented Disaster


This post is part of our ongoing series on network automation best practices, grounded in the PRIME Framework and PRIME Philosophy.

Transparency Note

Examples, scenarios, and any outcome figures in this article are provided for education and are based on enterprise delivery experience or anonymised composite scenarios unless explicitly identified as direct Nautomation Prime client outcomes.

Why This Blog Exists

Everyone loves a good war story—especially when there’s a lesson to be learned. Here are real-world automation failures, what went wrong, and how the PRIME Framework would have prevented disaster.


🚦 PRIME Philosophy: Learning from Failure

  • Transparency: Document what happened and why, share postmortems
  • Measurability: Track outcomes, failures, and recovery times
  • Ownership: Take responsibility for automation and its impact
  • Safety: Build in checks, validation, and rollback for every workflow
  • Empowerment: Share lessons so others don’t repeat mistakes, foster a blameless culture

Failure Story #1: The Unvalidated VLAN Push

Scenario: A script pushed VLAN changes to 500 switches in parallel—without validation. Half the network lost connectivity, causing a major outage.

Technical Breakdown:

1
2
3
# What went wrong: No validation, no error handling, no rollback
for device in devices:
  netmiko.ConnectHandler(...).send_config_set(["vlan 123"])  # blindly pushes config

Root Causes:

  • No pre-flight validation (didn't check if VLANs already existed or if devices were reachable)
  • No error handling or rollback (failures left devices in inconsistent states)
  • No change tracking (couldn't prove what changed or when)

How PRIME Would Have Helped:

  • Pre-flight checks (Pinpoint, Re-engineer)
  • Transactional changes with rollback (Implement)
  • Change tracking and reporting (Measure)

PRIME-Aligned Solution:

# PRIME: Validate, change, rollback, and log
for device in devices:
  if not validate_vlan_absent(device, vlan_id):
    log(f"{device}: VLAN {vlan_id} already exists, skipping")
    continue
  try:
    backup = get_running_config(device)
    push_vlan(device, vlan_id)
    log_change(device, vlan_id)
  except Exception as e:
    restore_config(device, backup)
    alert(f"Rollback on {device} due to: {e}")

Best Practices:

  • Always validate device state before making changes
  • Use dry-run or pre-checks to catch issues early
  • Build rollback and error handling into every script
  • Log every change for audit and troubleshooting

Failure Story #2: The Credential Leak

Scenario: A consultant hardcoded device passwords in a public Git repo. The credentials were scraped and used for unauthorized access, resulting in a security incident.

Technical Breakdown:

# What went wrong: Hardcoded secrets
DEVICE_PASSWORD = "SuperSecret123"  # committed to git!

Root Causes:

  • Hardcoded secrets in code
  • No credential management or rotation
  • No audit trail for secret access

How PRIME Would Have Helped:

  • Secure credential storage (Safety)
  • Audit logging and access control (Measurability)
  • Team training and onboarding (Empowerment)

PRIME-Aligned Solution:

1
2
3
4
5
6
# PRIME: Use a secrets manager and environment variables
import os
import hvac  # HashiCorp Vault client
client = hvac.Client(url=os.environ["VAULT_ADDR"])
creds = client.secrets.kv.read_secret_version(path="network/devices")
password = creds["data"]["data"]["password"]

Best Practices:

  • Never hardcode credentials—use environment variables or vaults
  • Rotate secrets regularly and audit access
  • Train all contributors on secure coding practices

Failure Story #3: The Untouchable Script

Scenario: A critical automation script was written by a contractor, undocumented and unmaintainable. When requirements changed, nobody could update it, leading to technical debt and business risk.

Technical Breakdown:

1
2
3
4
# What went wrong: No docs, no standards, no ownership
def do_everything():
  # ... 500 lines of magic ...
  pass

Root Causes:

  • No documentation or code comments
  • No knowledge transfer or onboarding
  • Vendor lock-in and lack of ownership

How PRIME Would Have Helped:

  • Inline documentation and code reviews (Transparency)
  • Knowledge transfer and runbooks (Empowerment)
  • Vendor-neutral, open-source design (Ownership)

PRIME-Aligned Solution:

# PRIME: Document, modularize, and share
def backup_device(device):
  """Backup device config and return path to backup file."""
  # ...

def push_config(device, config):
  """Push validated config to device."""
  # ...

# See docs/automation/backup.md for full workflow

Best Practices:

  • Document every script and workflow
  • Share knowledge through runbooks, wikis, and workshops
  • Avoid vendor lock-in by using open standards and tools

PRIME in Action: Turning Failure into Success

  • Document every failure and fix (blameless postmortems)
  • Build validation, rollback, and audit into every workflow
  • Share lessons learned with the team and community
  • Use failures as opportunities to improve processes and culture

Advanced Recovery Playbook

  • Immediate Triage: Stop automation, assess blast radius, and communicate transparently.
  • Root Cause Analysis: Use logs, version control, and device state to reconstruct what happened.
  • Rollback: Restore device state using backups or transaction logs.
  • Postmortem: Document the incident, fixes, and new safeguards. Share with the team.
  • Continuous Improvement: Update runbooks, scripts, and training to prevent recurrence.

Summary: Blog Takeaways

  • Every failure is a learning opportunity
  • PRIME principles prevent repeat mistakes and build resilient automation
  • Build transparency, safety, and ownership into every automation
  • Use technical controls (validation, rollback, secrets management) and cultural practices (blameless postmortems, documentation) together
  • Always test automation in a lab before production


📣 Want More?