NOC Intelligence

When 40% of Work is Invisible.

Role Product Designer & User Researcher
Year 2019 – 2022
Domain Enterprise · B2B
Nokia
Extreme Networks
VMware
Telefónica
ExtremeCloud IQ interface
01 TL;DR

Transforming network operations from reactive crisis management to proactive failure prevention.

Network monitoring tools captured every metric. But they missed 40–50% of operators' work: manually correlating data across 5 fragmented systems.

10+ min gathering data before diagnosis begins
5+ tools requiring constant context switching
0% proactive detection , everything reactive
$920M–1.3B industry-wide annual waste on invisible work

Unified correlation

System connects what operators did manually

Predictive warnings

20–30 min advance notice before failures

NOW WARNING FAILURE

Adaptive guidance

Matches support to operator expertise

NOVICE EXPERT
15× efficiency Data gathering 10+ min → under 1 min (Nokia, 32M subscribers)
2× faster diagnosis Manual correlation eliminated (Extreme Networks)
32M subscribers protected Failures prevented 20–30 min before customers notice
PREDICTED
02 My Role

UX Designer & Researcher

Collaborated with Project Manager, Interaction Designer, and Visual Designer on multidisciplinary teams at UXReactor.

I redesigned network operations tools for Nokia, Extreme Networks, and VMware. Observed 24 operators and discovered they spent 40–50% of their time on invisible work: manually correlating data across fragmented systems. Designed a unified workspace that automates correlation, preserves context, and adapts to operator expertise. Validated with operators monitoring 32M live subscribers.

Impact: 15× efficiency improvement, 2× faster diagnosis, proactive prevention replacing reactive firefighting.

Workshop session at a whiteboard
03 Big Picture

The industry wastes nearly $1 billion annually on work that shouldn't exist.

$6.14B Serviceable Addressable Market NOC Services Market
$11.5B Total Addressable Market IT Operations Management Market
$920M–$1.3B The Hidden Cost Wasted annually on manual correlation work
Illustration of people working across different spaces

Efficiency improvement matters because the problem scales across an entire industry. In network operations centers (NOCs), where operators monitor and maintain critical infrastructure for telecom providers and large enterprises, much of the work happens across fragmented tools and systems. The inefficiencies I observed, operators spending half their time on invisible work, affect 50,000+ network operators globally. These teams are responsible for keeping always-on infrastructure running, supporting hospitals, banks, telecoms, governments, and manufacturers that rely on 24/7/365 uptime.

Additionally, 5G rollout requires 3–5× monitoring complexity, while IoT introduces millions of additional endpoints. Reactive-only systems don’t scale in this environment. A unified, predictive approach enables next-generation infrastructure.

Clients knew something was broken.
But their understanding was surface-level…

"By the time we get the data, the customer's already called in angry."

"Our juniors are checking every single metric. They need better training and clearer alerts."

"Five systems, manual correlation each time. We need consolidation but don't know where to start."

04 Research

Clients saw symptoms. We suspected root causes.

24 operators, 1 hour each, during actual failures Shadowed Live Incidents
12 operators logged 247 incidents: every tool, every switch, every minute Tracked Two Weeks of Work
8 seniors vs. 8 juniors, same scenarios, measured what they checked and why Compared Expert vs. Novice
NOVICE EXPERT
9:47 AM , INCIDENT ALERT 10:02 AM , ESCALATION 2:34 AM , NIGHT SHIFT 5:58 AM , SHIFT HANDOFF 5 tools open. Still can't find where the fault actually is. I checked topology, alarms, tickets , still missing customer impact data. Which alarms? Which customer? Connected or separate issues? No context in the ticket. Three incidents. Notes in the ticket. All resolved, I think. Which tools? Which customers? Is it still ongoing? TOOL #1 → #2 → #3 → STILL NO ANSWER CONTEXT LOST IN TRANSLATION

What we learned.

40–50% Invisible Integration Work

Systems generated data. Humans performed integration. Every incident: operators manually correlated alarm IDs → device locations → performance metrics → customer impact across 5 disconnected tools. 10+ minutes, 18–23 context switches before diagnosis even began.

Star & Strauss (1999): "Work that keeps things running but never shows up in system dashboards." That coordination labor? Never measured. Never improved.

Context Died with Every Interruption

NOC work isn't linear. New alerts fire. Shifts change. L1 escalates to L2. Every interruption destroyed context. L2 spent 5–10 minutes rebuilding mental models. That information existed nowhere except L1's head and scattered across 5 tool histories.

Miller (1956): Working memory holds 7±2 items. Operators juggled 20+ disconnected data points. Systems preserved data. They didn't preserve understanding.

Zero Proactive Capability

Systems alerted AFTER thresholds breached. Never before. 0% early detection. By the time alerts fired, customers were already affected. Every incident became a crisis because systems provided no temporal visibility into emerging patterns.

"Everything is on fire and needs fixing 20 minutes ago." , Reddit operator, r/sysadmin

Juniors Weren't Undertrained; They Were Overwhelmed

Seniors: 5–7 metrics checked, 100% accuracy. Juniors: 25+ metrics checked, 37% accuracy. Juniors weren't checking too few , they were checking too many. Systems showed everything, prioritized nothing.

Chi, Feltovich & Glaser (1981): Experts use pattern recognition and selective attention. Novices search exhaustively. Not a training problem. A system design problem.

Reddit post about constant task switching r/sysadmin
Comment about alert escalation and monitoring noise koloth44
Comment about constant brain-shifting through the day jmnugent

Focus Area

TOPOLOGY ALARMS PERF. TICKETS CUSTOMER 18–23 switches per incident

Data Fragmentation

Operators became the integration layer, switching between 5 disconnected tools 18–23 times per incident. Context destroyed with every switch. Every escalation meant rebuilding understanding from scratch.

HMW preserve context across tools, interruptions, and escalations?

TOPOLOGY ALARMS PERF. TICKETS 40–50% capacity

Manual Correlation

No shared data model or correlation engine. Operators manually mapped relationships with notepads, spreadsheets, and memory. 40–50% of operator capacity spent on integration labor that never appeared in metrics.

HMW make the system perform the correlation operators did manually?

THRESHOLD BREACH ALERT FIRES IMPACT 0% proactive detection

Reactive-Only Architecture

Threshold-based alerting reacts after breach , no pattern recognition, no temporal analysis of degradation trends. 0% proactive detection. Preventable outages reached customers first.

HMW enable operators to prevent failures, not just respond?

25+ METRICS 37% accuracy 5 METRICS 100% accuracy

Undifferentiated Tool Design

Same overwhelming interface for novices and experts. Juniors checked 25+ metrics with 37% accuracy. Seniors checked 5–7 with 100% accuracy , manually filtering noise the system should hide.

HMW match system intelligence to operator expertise?

05 Solution
Solution

Solutioning Framework

Pattern Recognition

System surfaces relationships operators currently discover manually. Correlation becomes a system capability, not human burden.

TOPOLOGY ALARMS PERF. TICKETS CUSTOMER SYSTEM correlates correlation → system capability
Proactive Capability

Temporal visibility into emerging patterns enables prevention. Operators act before thresholds breach, not after.

THRESHOLD TREND DETECTED would breach PERF. TIME 20–30 min advance warning
Adaptive Intelligence

System support matches operator expertise. Juniors get structured guidance. Seniors get noise filtered out.

JUNIOR SENIOR 1 2 3 structured guidance noise filtered out
Context Preservation

Mental models must survive interruptions, tool switches, and escalations. Spatial layout preserves what working memory can't hold.

TASK A INTERRUPTED RESUMED switch resume CONTEXT PRESERVED working memory spatial layout

Four problems identified. Four different solutions needed.

We didn't jump to wireframes. Surface-level consolidation , "one workspace" , wouldn't address invisible work, reactive architecture, or expertise mismatch. We needed a framework that solved each problem deliberately.

First, we mapped ideal workflows for each operator tier (L1, L2, L3). Not what they currently did, but what they SHOULD be able to do if systems supported them properly. What would diagnosis look like if context never died? If correlation happened automatically? If operators could prevent instead of react?

These ideal-state workflows revealed four design principles that became our framework:

Validation Approach

Workflow mapping board Concept 1: Location Intelligence with Lens GetMap validation screen GetMap service drill-down screen
  1. Paper sketches tested with 12 operators during live monitoring shifts.
  2. Mid-fidelity functional prototypes tested under real operational pressure.
  3. Refinement based on what worked during actual incidents, not controlled scenarios.

Fitts's Law Violation
Distance = Time = Cognitive Cost

5 separate tools created maximum distance between related data. Every tool switch added physical distance (mouse travel), temporal distance (window switching), and cognitive distance (mental model rebuilding).

Framework diagram

Automated Correlation

System automatically correlates data that operators previously connected manually. Cognitive burden shifts from human memory to system capability. Operators are freed from integration labor to focus on diagnosis.

Gestalt Principle
Proximity Creates Relationship

Unified workspace places related data in visual proximity. Alarm ID, affected devices, and customer impact appear together. Operators see connections through spatial layout, not memory. Tool 1's alarm now visibly links to Tool 5's customer impact.

Solution 1: Context That Survives Interruptions

Problem: Data Fragmentation (18–23 switches, 10+ min gathering data) · HMW: Preserve context across tools, interruptions, escalations? · Principle: Context Preservation

The Solution: Unified workspace with persistent spatial layout. Timeline left, correlation center, customer impact right. Operators navigate within workspace, not between tools.

Unified workspace with persistent context Temporal context persisted across investigation Spatial context preserved - network topology always visible Summary context (41 alarms, 25 events) always present Device details in same workspace (no tool switch to performance system)

Working Memory Overload

Miller (1956) showed humans max out at 7±2 items. Operators juggled 20+ across tools. Workspace externalizes what exceeded cognitive capacity.

Nielsen #6: Recognition Over Recall

Operators see relationships instead of remembering them between tools.

Solution 2: System Does the Correlation

Problem: Manual Correlation (40–50% time on invisible work) · HMW: Automate integration operators performed manually? · Principle: Pattern Recognition

The Solution: Auto-linked correlation engine connects alarm IDs → affected devices → customer impact → performance metrics. System performs integration operators did with notepads.

Correlated workspace , linked incidents Eliminates manual timeline matching Auto-linked incidents same workspace System correlates events automatically Performance data linked to device

Distributed Cognition

Hutchins (1995): Cognitive work distributed to system. Operators interpret integrated data instead of integrating scattered data.

Invisible Work Made Visible

Star & Strauss (1999): Coordination labor "invisible to rationalized models." System performs this automatically. 40–50% of capacity redirected to diagnosis.

Correlation engine , wide view
Alarm to device correlation
Device to customer correlation

Solution 3: See Failures Before They Happen

Problem: Reactive-Only Architecture (0% proactive detection) · HMW: Enable prevention, not just response? · Principle: Proactive Capability

The Solution: Predictive timeline with 20–30 min advance warning. Statistical anomaly detection surfaces early degradation patterns before threshold breach.

Predictive timeline with confidence indicators

20-30 min advance warning enables intervention

Predictive visibility into future quality degradation

"Actual" vs "Forecast" split timeline

Confidence indicators show prediction reliability

Temporal Visibility

Suchman (1987): Can't prevent what you can't foresee. Predictive layer provides temporal visibility systems lacked.

Trust Through Transparency

Lee & See (2004): Confidence indicators prevent automation bias. Operators know when to trust predictions.

Solution 4: Guidance That Matches Expertise

Problem: Undifferentiated Tool Design (juniors 37%, seniors 100%) · HMW: Match system intelligence to operator skill? · Principle: Adaptive Intelligence

The Solution: Contextual guidance adapts to expertise. Juniors see highlighted metrics with "check this first" priority. Seniors see minimal interface with pre-filtered noise.

Adaptive guidance scene

Adaptive perspective switching - juniors see grouped summaries

Contextual filtering reduces noise for focused investigation

System aggregates individual problems into geographic patterns - eliminates manual region-by-region checking

Timeline Synchronized with Map View: Framework marries location, quality scores, metrics, temporal data, and incidents in single correlated view - operators see geographic + technical + customer impact patterns without mental integration across tools

Scaffolding for Novices

Lave & Wenger (1991): Legitimate peripheral participation. Juniors need structured access to expert practice.

Expert Selective Attention

Chi et al. (1981): Experts use pattern recognition, filter noise. Pre-filtered interface respects expert chunking patterns (5–7 metrics, not 25+).

Four solutions working as a unified system.
Context preserved, correlation automated, failures predicted, expertise supported.

Visual Design System

Color palette Typography hierarchy Typography system

Dark NOC environments require high-contrast interfaces. Blue primary colors preserve night vision. Noto Sans provides character distinction at small sizes (0/O, 1/l/I). Color-blocked sections (Health Timeline purple, Alarms blue, Map orange) create spatial organization operators recognize in peripheral vision across 3–5 monitors.

Operators work 8–12 hour shifts under fatigue. Every design choice , color contrast, font size, spacing , accounts for physical environment and operational stress.

06 So what

Nokia Experience allows us to proactively monitor our customers' experiences and take the actions needed, based on measured trends, and done through automation.

Brendon O'Reilly

Operator Level: From Firefighting to Flow

Before
After
10+ minutes gathering data across 5 tools
<1 minute unified workspace, context preserved
Manual correlation , 40–50% of time wasted
Automated correlation , 2× faster diagnosisExtreme Networks
2,000 experiences monitored per operator
30,000 experiences monitored per operatorNokia
0% proactive detection , reactive firefighting
40%+ early detection , preventive intervention
Juniors: 25+ metrics, 37% accuracy, drowning in data
Juniors: 7–9 metrics, guided attention, structured learning

Business Level: From Reactive Cost to Strategic Value

Before
After
Preventable outages reaching customers , SLA penalties
100% customer growth rateNokia , 32M subscribers