Why Nornir
Why Nornir? Understanding the Problem and Solutionยถ
"From 30 Minutes to 3 Minutes โ Why Enterprise Networks Need Parallel Automation"ยถ
You've completed the Beginner Tutorials and successfully built a multi-device config backup script. It works great for 10 devices, even 50 devices. But what if your organisation has 500 devices? Or 5,000?
In this tutorial, we'll uncover the critical scalability problem with your current approach, demonstrate how it manifests in real networks, and introduce Nornirโthe solution designed for enterprise automation.
Important: This tutorial is conceptual. We're NOT writing production code yet. We're understanding the problem so that Nornir's solution makes sense.
๐ฏ What You'll Learnยถ
By the end of this tutorial, you'll understand:
- โ Why loops are fundamentally limited for device operations
- โ The mathematical principle of parallelization (Amdahl's Law)
- โ Real-world performance impact: sequential vs. parallel
- โ Nornir's architecture and why it's designed differently
- โ The cost/benefit tradeoff of adding framework complexity
- โ When Nornir is the right choice (and when it isn't)
๐ด The Problem: Sequential Bottleneckยถ
Let's revisit your Beginner Tutorial #3:
What this does:
- Connect to Device #1
- Retrieve config (5 seconds of network I/O)
- Save to file (1 second)
- Disconnect
- THEN move to Device #2
- Repeat...
The fundamental issue: While the script waits for Device #1's network response, your CPU is completely idle. It can't fetch Device #2's configโit's stuck waiting.
Real-World Impactยถ
Let's do some math:
Scenario: Enterprise network with 300 devicesยถ
Per-device timing:
- SSH connection: 2 seconds
show running-configexecution: 3 seconds (network latency)- File write: 1 second
- Total per device: ~6 seconds
Sequential approach (Tutorial #3):
Parallel approach (Nornir):
Real-world result: The same job takes 30 minutes with your current script but only 2-3 minutes with Nornir.
That's a 10-15x speedup.
๐ Visualizing Sequential vs. Parallelยถ
Sequential Execution (Tutorial #3 Approach)ยถ
Notice: While Device 1 waits for the network, Devices 2, 3, 4 aren't even started. The CPU is idle.
Parallel Execution (Nornir Approach)ยถ
Notice: While Device 1 waits for the network, Devices 2, 3, 4 are fetching simultaneously. The network is fully utilized.
Task Execution Flow Comparisonยถ
Sequential Task Flowยถ
flowchart TD
Start([Start Backup Job]) --> D1[Connect Device 1]
D1 --> F1[Fetch Config]
F1 --> S1[Save File]
S1 --> D2[Connect Device 2]
D2 --> F2[Fetch Config]
F2 --> S2[Save File]
S2 --> D3[Connect Device 3]
D3 --> F3[Fetch Config]
F3 --> S3[Save File]
S3 --> D4[Connect Device 4]
D4 --> F4[Fetch Config]
F4 --> S4[Save File]
S4 --> End([Job Complete])
style D1 fill:#ffcccc
style D2 fill:#ffcccc
style D3 fill:#ffcccc
style D4 fill:#ffcccc
style F1 fill:#ccccff
style F2 fill:#ccccff
style F3 fill:#ccccff
style F4 fill:#ccccff
Parallel Task Flow (Nornir)ยถ
flowchart TD
Start([Start Backup Job]) --> Pool["Connection Pool Initialized
(up to 10 workers)"]
Pool --> D1[Device 1:
Connect + Fetch]
Pool --> D2[Device 2:
Connect + Fetch]
Pool --> D3[Device 3:
Connect + Fetch]
Pool --> D4[Device 4:
Connect + Fetch]
D1 --> Save[Results Aggregated]
D2 --> Save
D3 --> Save
D4 --> Save
Save --> End([Job Complete])
style Pool fill:#ccffcc
style D1 fill:#ffffcc
style D2 fill:#ffffcc
style D3 fill:#ffffcc
style D4 fill:#ffffcc
style Save fill:#ccffcc
๐งฎ The Math: Amdahl's Lawยถ
Why doesn't this scale infinitely? There's a mathematical ceiling:
Amdahl's Law:
For network operations (which are ~95% parallel):
- 10 parallel connections: 8.3x speedup
- 20 parallel connections: 13.3x speedup
- 50 parallel connections: 26x speedup
- 100 parallel connections: 47x speedup (diminishing returns visible here)
Practical takeaway: You get massive gains up to ~10-20 concurrent connections, then diminishing returns. But even diminishing returns beat sequential by a mile.
๐ Why Your Tutorial #3 Script Doesn't Scaleยถ
Your current multi-device-config-backup.py uses this pattern:
This is sequential iteration. It's simple, it's clear, it's great for learningโbut it's a dead-end for enterprise scale.
The Limitationsยถ
| Aspect | Tutorial #3 | Enterprise Need |
|---|---|---|
| Max devices | 50-100 (before slowness) | 500-5000+ |
| Expected runtime | 10+ minutes | 2-3 minutes |
| Code complexity | Simple loops | Framework (Nornir) |
| Failure isolation | Per-device try/catch | Unified result aggregation |
| Extensibility | Hard (one-off changes) | Easy (reusable tasks) |
| Team reusability | One script per job | Shared task library |
โ๏ธ Interlude: Why Not Just Use Threading in Python?ยถ
You might think: "Why learn Nornir? Can't I just add threading to Tutorial #3?"
You could, but here's why that's a bad idea. (And if you want the full story on why threading is so risky for network automation, check out our deep-dive: Threading in Network Automation: When to Use It and When to Avoid It)
Problems:
- Python's GIL (Global Interpreter Lock) โ Threads don't actually run in parallel; they take turns
- Result aggregation โ Where does output go? How do you collect all results?
- Result aggregation โ No unified error handling
- Credentials โ Thread-safe password management gets complex
- Scalability โ Creating 500 threads crashes Python
Nornir doesn't use threading. It uses async I/O (via asyncio), which allows true concurrent operations without the GIL limitations.
๐๏ธ Nornir's Architectureยถ
Nornir solves this problem by building a task-based framework instead of a script-based one.
Core Conceptsยถ
1. Tasks (not loops)ยถ
Instead of:
You write:
2. Inventory (not hardcoded or CSV)ยถ
Nornir abstracts device information:
3. Runner (not manual iteration)ยถ
Nornir's runner automatically:
- Loads all devices from inventory
- Executes tasks in parallel
- Collects results
- Handles failures
4. Result Aggregation (not scattered output)ยถ
The benefit: Nornir handles all the parallel complexity for you. You focus on the business logic.
๐ Architecture Comparisonยถ
Tutorial #3 (Sequential Script Architecture)ยถ
Characteristics:
- Linear control flow
- One device at a time
- Results scattered (some in variables, some in files)
- Hard to reuse (tied to specific task logic)
Nornir (Task-Based Parallel Architecture)ยถ
Characteristics:
- Task-based (functional programming)
- Parallel by default
- Unified result object
- Highly reusable (tasks are libraries)
๐ก When to Use Nornirยถ
Use Nornir Whenยถ
โ
Scale matters (50+ devices)
โ
Performance matters (tight backup windows)
โ
Complexity exists (multi-step workflows, compliance checks)
โ
Teams collaborate (shared task libraries)
โ
Enterprise requirements (audit trails, integration, reliability)
โ
Future growth (will your network grow?)
Use Tutorial #3 When:ยถ
โ
Quick one-off script
โ
Very small network (<10 devices)
โ
Learning automation basics (Tutorial #3 is perfect for this)
โ
No performance requirements
๐ Detailed Comparison: Approaches to Multi-Device Automationยถ
The table below breaks down how different approaches compare across real-world concerns:
| Aspect | Tutorial #3 (Sequential) | Threading (DIY) | Nornir (Framework) | Ansible (Alternative) |
|---|---|---|---|---|
| Learning curve | Easy | Moderate | Moderate | Moderate-Hard |
| Max devices | ~100 | ~50 (GIL limits) | 500-5000+ | 1000+ |
| Runtime (100 devices) | 10 min | 2-3 min* | 1-2 min | 2-3 min |
| Code complexity | Low | High | Moderate | High |
| Error isolation | Try/catch per device | Thread local storage | Native (per-host) | Native (per-host) |
| Credential management | Hardcoded/env vars | Thread-safe needed | Secure pattern | Vault support |
| Team reusability | One-off scripts | Hard (threading logic) | Easy (task libraries) | Easy (playbooks) |
| Extensibility | Hard | Very hard | Easy | Easy |
| Logging | Messy in parallel | Race conditions | Clean/unified | Clean/unified |
| Integration | Manual (APIs, DBs) | Manual | Plugin system | Module system |
| Production-ready | No | Rarely | Yes | Yes |
| Maintenance burden | Low initially, high later | Very high | Moderate | Moderate |
- Threading performance varies wildly due to GIL contention
โ ๏ธ Real-World Gotchas & Edge Casesยถ
Gotcha #1: The 3 AM Production Outageยถ
Scenario: Your sequential script has been running fine for 6 months. Your network grows 10x. Now backups that took 30 minutes take 5 hours.
The problem: You didn't anticipate scale early.
The lesson: Planning for scale isn't premature optimizationโit's professional development.
Gotcha #2: The Failing Device That Kills Everythingยถ
Sequential script (unprotected):
Real scenario: Device 47 has SSH timeout. Your backup never completes. Management asks "why weren't the other 53 devices backed up?"
Solution: Framework-level error isolation (Nornir handles this automatically)
Gotcha #3: Credentials Leak Into Logsยถ
Common mistake:
In parallel environments, this becomes even more visible. Nornir's logging patterns protect you from this.
Gotcha #4: Device Dependency Chainsยถ
Real scenario: Before backing up an access switch, you need to pull its inventory from your IPAM system.
Sequential: Can't start step 2 until step 1 completes (correct!)
Threading DIY: Race conditions if not careful
Nornir: Built-in patterns for this (Tutorial #3 covers this!)
Gotcha #5: Memory Exhaustion with Large Device Countsยถ
Scenario: You parallelize all 5,000 devices at once.
What happens:
- 5,000 SSH connections ร 4MB per connection = 20GB RAM
- Python crashes
- Takes you 2 hours to figure out why
The fix: Connection pools with "max workers" limiting (Nornir: num_workers: 50)
๐ Practical Decision Treeยถ
Use this to decide which approach is right now:
๐ You've Got Options, But They're Differentยถ
Honest truth: There's no "best" tool. There's the right tool for your current situation.
- Tutorial #3 is your "learn automation" tool
- Threading is your "never use this" tool (seriously, don't โ and if you want to know why, see our deep-dive: Threading in Network Automation: When to Use It and When to Avoid It)
- Nornir is your "production ready" tool
- Ansible is your "infrastructure as code" tool
They're solving different problems at different scales. Nornir solves this problem (parallel network device operations) extremely well.
๐งช Interactive Learning Checkpointยถ
Before moving on, ask yourself:
-
Do you understand why loops alone won't work for many devices?
- If no: Re-read "The Problem: Sequential Bottleneck"
- If yes: โ Move forward
-
Can you explain parallel execution to someone?
- If no: Study the Mermaid diagrams and ASCII art above
- If yes: โ Move forward
-
Do you know when you'd use Nornir vs. Tutorial #3?
- If no: Review the "When to Use" section
- If yes: โ You're ready for Tutorial #2
Stuck? This is that moment where concepts should click. Take 10 minutes and re-read any section that confused you. This foundation matters for everything coming next.
๐ฏ The Production Realityยถ
In real organisations, here's what happens:
Month 1: "Let's automate config backups!"
โ Build Tutorial #3 script
โ Works great!
Month 3: "We added offices in Asia and Europe. Backups now take 90 minutes."
โ "Hmm, let me add threading..."
โ Threads cause issues...
Month 6: "Can we also do compliance checking? And integrate with our ticketing system?"
โ "The script is spiraling... This needs a redesign..."
โ This is where you wish you'd started with Nornir
๐ Under the Hood: Why Nornir Worksยถ
Nornir uses asyncio (Python's asynchronous I/O library) under the hood:
Nornir abstracts this complexity, so you write simple task functions and Nornir handles the async execution automatically.
๐ Real Enterprise Exampleยถ
Telecom company with 2,500 Cisco devicesยถ
Old approach (Tutorial #3):
With Nornir:
The business value: 20 minutes used by automation instead of 2+ hours = real cost savings.
๐ง The Learning Curveยถ
Truth: Nornir IS more complex than Tutorial #3.
But complexity serves a purpose:
The cost/benefit: Adding moderate complexity early saves enormous complexity later (no threading hacks, no refactoring).
๐ฎ What's Coming Nextยถ
In Tutorial #2: Nornir Fundamentals, we'll:
- Install Nornir and dependencies
- Create your first inventory file
- Write your first
@taskfunction - Run it against 5+ devices in parallel
- See the performance benefit firsthand
Spoiler: You'll write basically the same logic as Tutorial #3, but Nornir will parallelize it automatically.
๐ฏ Key Takeawayยถ
If you're automating networks at any significant scale:
Sequential scripts = Training wheels
Nornir = Real enterprise tool
You don't need to choose immediately. But if you're building anything more than a quick proof-of-concept, learning Nornir is an investment that pays dividends.
๐ฌ Your Perspectiveยถ
As someone building this for the first time, here's my honest take:
- Nornir feels more complex when you first see it (it is)
- BUT it's designed specifically for your problem (parallel network ops)
- AND the payoff is huge (10-20x faster)
- AND once you understand it, it becomes your default tool
๐ Before You Continueยถ
Make sure you have:
- โ Completed all Beginner Tutorials
- โ Successfully run Tutorial #3 on at least 5 devices
- โ Observed how long it takes (30+ min for many devices)
- โ Understood the sequential bottleneck
When you're ready, Tutorial #2 โ will teach you to solve this problem with Nornir.
๐ Questions Before Moving On?ยถ
"Do I really need Nornir?"
- If you have <20 devices and no growth expected: Probably not โ Tutorial #3 is enough
- If you have >50 devices or growth expected: Absolutely yes
- If you're in between: Consider learning it so you're prepared as you scale
"Will I still use the Tutorial #3 approach?"
- Yes, for quick/one-off scripts
- But for anything you'll run more than once or scale: Nornir
"Is Nornir hard to learn?"
- Moderate difficulty (Tutorial #2 makes it accessible)
- But the concepts are universal (async I/O, task-based automation)
- Worth the investment
Continue to Tutorial #2: Nornir Fundamentals โ