Blueprint for Enterprise-Ready Network Automation Pipelines

Blueprint for Enterprise-Ready Network Automation Pipelines¶

This post is part of our ongoing series on network automation best practices, grounded in the PRIME Framework and PRIME Philosophy.

Transparency Note

Examples, scenarios, and any outcome figures in this article are provided for education and are based on enterprise delivery experience or anonymised composite scenarios unless explicitly identified as direct Nautomation Prime client outcomes.

Why This Blog Exists¶

Enterprise automation is more than scripts—it’s pipelines, version control, and safe rollouts. This post covers how to build CI/CD, GitOps, and containerized pipelines for network automation, and how the PRIME Framework ensures safety and empowerment.

🚦 PRIME Philosophy: Safety and Empowerment¶

Safety: Automated testing, validation, and rollback
Empowerment: Enable self-service and rapid iteration
Transparency: Document every change and deployment
Measurability: Track outcomes and failures
Ownership: Your team controls the pipeline

Why Automation Pipelines Matter¶

Safety: Automated testing and validation gates catch bugs before production
Repeatability: Same process every time, no human error
Audit trail: Every deployment is tracked and reversible
Speed: Automated testing and deployment are much faster than manual processes
Scalability: Can handle 1,000s of devices consistently
Compliance: Demonstrates robust change control and audit procedures

Pipeline Architecture: Overview¶

A typical enterprise network automation pipeline has these stages:

1
2
3

Code Push → Lint & Format → Unit Tests → Integration Tests → 
Staging Deployment → Smoke Tests → Approval → Production Deployment → 
Monitoring & Alerts → Rollback (if needed)

Example: Building a CI/CD Pipeline (GitHub Actions)¶

Step 1: Lint and Format Check¶

# .github/workflows/lint.yml
name: Lint and Format

on: [push, pull_request]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'

      - name: Install dependencies
        run: |
          pip install flake8 black pylint

      - name: Run black formatter
        run: black --check src/ tests/

      - name: Run flake8 linter
        run: flake8 src/ tests/ --max-line-length=100

      - name: Run pylint
        run: pylint src/

Step 2: Unit Testing¶

# .github/workflows/unit-tests.yml
name: Unit Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: [3.8, 3.9, '3.10']
    steps:
      - uses: actions/checkout@v2

      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v2
        with:
          python-version: ${{ matrix.python-version }}

      - name: Install dependencies
        run: |
          pip install -r requirements-dev.txt

      - name: Run pytest
        run: |
          pytest tests/ -v --cov=src --cov-report=html

      - name: Upload coverage
        uses: codecov/codecov-action@v2
        with:
          file: ./coverage.xml

Step 3: Integration Testing (Mock Devices)¶

# .github/workflows/integration-tests.yml
name: Integration Tests

on: [push, pull_request]

jobs:
  integration:
    runs-on: ubuntu-latest
    services:
      mock-devices:
        image: mock-network-devices:latest
        ports:
          - 22:22
    steps:
      - uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run integration tests
        run: pytest tests/integration/ -v --testbed=mock_testbed.yaml

      - name: Test device reachability
        run: |
          python -c "from netmiko import ConnectHandler; print('Mock devices running')"

Step 4: Staging Deployment¶

# .github/workflows/deploy-staging.yml
name: Deploy to Staging

on:
  push:
    branches: [ main ]

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Deploy to staging
        env:
          VAULT_ADDR: ${{ secrets.VAULT_ADDR }}
          VAULT_TOKEN: ${{ secrets.STAGING_VAULT_TOKEN }}
        run: |
          python scripts/deploy.py --environment staging --use-vault

      - name: Run smoke tests
        run: |
          pytest tests/smoke/ -v --environment staging

      - name: Notify Slack
        uses: slackapi/slack-github-action@v1.24.0
        with:
          payload: |
            {
              "text": "Staging deployment successful"
            }

Step 5: Production Deployment (Blue-Green)¶

# .github/workflows/deploy-prod.yml
name: Deploy to Production

on:
  workflow_dispatch:  # Manual approval required

jobs:
  deploy-prod:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Blue-Green Deployment
        env:
          VAULT_ADDR: ${{ secrets.VAULT_ADDR }}
          VAULT_TOKEN: ${{ secrets.PROD_VAULT_TOKEN }}
        run: |
          # Deploy to blue environment
          python scripts/deploy.py --environment production-blue

          # Run health checks
          python scripts/health-check.py --environment production-blue

          # Switch traffic to blue
          python scripts/switch-traffic.py --from red --to blue

          # Run validation
          pytest tests/smoke/ -v --environment production-blue

      - name: Rollback on failure
        if: failure()
        run: |
          python scripts/switch-traffic.py --from blue --to red
          echo "Rollback completed"

      - name: Create incident ticket
        if: failure()
        uses: actions/github-script@v6
        with:
          script: |
            github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: 'Production deployment failed - rollback completed',
              body: 'Check deployment logs for details'
            })

Advanced Patterns: Approval Gates & ITSM Integration¶

ServiceNow Integration for Change Approvals¶

# scripts/change_approval.py
import requests
import os
import time

class ServiceNowChangeManager:
    def __init__(self):
        self.snow_url = os.environ['SERVICENOW_URL']
        self.snow_user = os.environ['SERVICENOW_USER']
        self.snow_pass = os.environ['SERVICENOW_PASS']

    def create_change_ticket(self, description, automation_type):
        """Create a change ticket in ServiceNow."""
        headers = {'Content-Type': 'application/json'}
        auth = (self.snow_user, self.snow_pass)

        data = {
            'short_description': description,
            'description': f'Automated {automation_type} deployment',
            'category': 'Network',
            'subcategory': 'Automation',
            'cmdb_ci': 'Network Infrastructure',
        }

        response = requests.post(
            f"{self.snow_url}/api/now/table/change_request",
            headers=headers,
            auth=auth,
            json=data
        )

        if response.status_code == 201:
            change = response.json()['result']
            return {'change_id': change['number'], 'sys_id': change['sys_id']}
        else:
            raise Exception(f"Failed to create change: {response.text}")

    def wait_for_approval(self, change_id, max_wait_minutes=30):
        """Wait for change to be approved."""
        headers = {'Accept': 'application/json'}
        auth = (self.snow_user, self.snow_pass)

        start_time = time.time()
        max_wait_seconds = max_wait_minutes * 60

        while time.time() - start_time < max_wait_seconds:
            response = requests.get(
                f"{self.snow_url}/api/now/table/change_request?number={change_id}",
                headers=headers,
                auth=auth
            )

            if response.status_code == 200:
                change = response.json()['result'][0]
                approval_state = change.get('approval', 'pending')

                if approval_state == 'approved':
                    return True
                elif approval_state == 'rejected':
                    raise Exception(f"Change rejected: {change.get('close_notes')}")

            print(f"Change {change_id} waiting for approval...")
            time.sleep(30)  # Check every 30 seconds

        raise Exception(f"Change approval timeout after {max_wait_minutes} minutes")

    def close_change(self, change_id, status='successful'):
        """Close a change ticket."""
        headers = {'Content-Type': 'application/json'}
        auth = (self.snow_user, self.snow_pass)

        data = {
            'state': 'closed',
            'close_code': '1' if status == 'successful' else '2',
            'close_notes': f'Deployment completed: {status}'
        }

        response = requests.patch(
            f"{self.snow_url}/api/now/table/change_request?number={change_id}",
            headers=headers,
            auth=auth,
            json=data
        )

        return response.status_code == 200

# Usage in GitHub Actions
change_mgr = ServiceNowChangeManager()
change = change_mgr.create_change_ticket("Deploy VLAN config", "network-config")
print(f"Change created: {change['change_id']}")

# Wait for approval (will block pipeline)
change_mgr.wait_for_approval(change['change_id'])
print("Change approved, proceeding with deployment")

# Deploy...

# Close the change
change_mgr.close_change(change['change_id'], status='successful')

Deployment Strategies: Blue-Green & Canary¶

Blue-Green Deployment¶

Blue Environment (Old)     Green Environment (New)
    ✓ Active                    ✗ Inactive

    → Traffic                  

After validation:

Blue Environment (Old)     Green Environment (New)
    ✗ Inactive                  ✓ Active

                   → Traffic

# scripts/blue_green_deploy.py
import asyncio
from nornir import InitNornir

async def deploy_blue_green(config_changes):
    """Deploy with blue-green strategy."""
    nr = InitNornir(config_file="config.yaml")

    # Deploy to green environment (idle)
    print("Deploying to GREEN environment...")
    green_nr = InitNornir(config_file="config_green.yaml")
    green_nr.run(task=apply_config, config=config_changes)

    # Validate green
    print("Validating GREEN environment...")
    green_nr.run(task=validate_config)

    # Switch traffic
    print("Switching traffic BLUE → GREEN...")
    switch_load_balancer_to_green()

    # Monitor green
    await monitor_health("green", duration=300)  # 5 minutes

    # If all good, old blue can be decommissioned
    print("Deployment successful!")

async def monitor_health(environment, duration=300):
    """Monitor health metrics for a period."""
    start = time.time()
    while time.time() - start < duration:
        health = check_health(environment)
        if health['error_rate'] > 0.05:
            raise Exception("High error rate detected, rolling back")
        await asyncio.sleep(10)

Canary Deployment¶

All traffic initially to Stable

Stable (99%)  →  New (1%)
    ↓              ↓
  1000 devices    10 devices

Monitor metrics, gradually shift traffic:
    50% → 50%  →  25% → 75%  →  0% → 100%

Monitoring & Rollback¶

Automated Rollback on Failure¶

def deploy_with_rollback(config_changes):
    """Deploy config with automatic rollback."""
    nr = InitNornir(config_file="config.yaml")

    # Backup current configs
    print("Backing up configurations...")
    nr.run(task=backup_config)

    # Apply changes
    print("Applying configuration changes...")
    results = nr.run(task=apply_config, config=config_changes)

    # Validate changes
    print("Validating changes...")
    validation = nr.run(task=validate_config)

    # Check for failures
    if validation.failed_hosts:
        print(f"Validation failed on: {validation.failed_hosts}")

        # Auto-rollback
        print("Rolling back configuration...")
        nr.run(task=restore_config)  # Restore from backup

        # Alert ops
        send_alert(f"Deployment failed and rolled back: {validation.failed_hosts}")
        return False

    return True

PRIME in Action: Safe Rollouts¶

Automate rollback on failure (detect errors, revert to last known good state)
Require approvals for production changes (integrate with ITSM, code owners)
Track every deployment, outcome, and incident (logs, dashboards, runbooks)
Use canary and blue-green deployments for zero-downtime rollouts
Integrate observability and incident response into every pipeline
Measure success: deployment frequency, lead time, failure rate, MTTR

Summary: Blog Takeaways¶

Pipelines make automation safe, scalable, and repeatable
Use CI/CD, GitOps, and containers for production-grade workflows
PRIME principles ensure safety, empowerment, and transparency
Integrate validation, observability, and rollback into every pipeline
Use advanced rollout patterns (blue-green, canary, staged) for safe production changes
Automate compliance, approvals, and incident response for enterprise readiness

Blueprint for Enterprise-Ready Network Automation Pipelines