JHB Infrastructure Intelligence · Municipal Analytics

01 · Case Study

Johannesburg's Infrastructure Gap

A city of 6 million people. Ageing pipes. Crumbling roads. And a service delivery system that often doesn't know where to look first.

Johannesburg's infrastructure is under sustained pressure. Every day, residents log hundreds of complaints about water leakages and road potholes — but without proper data infrastructure, the City can't separate chronic hotspots from one-off events, or track whether repairs are actually happening within SLA windows.

The result: resources get allocated reactively. The same street gets a complaint every three months. The same pipe zone floods twice a year. Contractors are dispatched blind. SLA breach rates creep up silently.

This project builds the analytics layer that should sit between raw complaint data and operations. The core questions: where are the real hotspots, who's breaching SLAs, and what patterns are hiding in the backlog?

The dataset is synthetic but shaped to mirror real Johannesburg operational patterns — ward boundaries, contractor names, seasonal timing, and the kinds of messy, incomplete records that real municipal data produces.

4.2 days

Median time-to-repair (potholes)

SLA target: 3 days. Most incidents breach this window without anyone noticing.

34.7%

SLA breach rate — water leaks

Over one in three water incidents exceed the 48-hour response window.

22%

Repeat incident rate

Almost a quarter of incidents recur on the same street segment within 90 days.

8 wards

Account for 41% of all complaints

Soweto, Alexandra, and Diepsloot dominate — but aren't getting proportional resources.

📍

Geospatial Dimension

Ward, suburb, and street-level granularity across all 114 JHB wards. Enables hotspot mapping and geographic clustering.

GeoPandas

⏱️

Temporal Signals

Report date, first-response timestamp, resolution date — enabling time-to-repair calculations and seasonal trend detection.

Time Series

🚧

Issue Typology

Two primary categories: potholes and water leakages. Each has distinct SLA windows, crew types, and seasonal drivers.

Classification

👷

Contractor Performance

12 contractors tracked across resolution speed, SLA compliance, and repeat-incident recurrence rates per assignment.

Relational

📊

200K Incident Records

Synthetically generated with realistic messiness — nulls, duplicates, inconsistent formatting — to simulate real municipal data quality.

Synthetic · Seeded

📋

Status Lifecycle

Open → In Progress → Resolved → Closed. Plus: re-opened incidents that signal poor-quality first-time fixes.

Workflow

02 · Data Pipeline

From Noise to Model-Ready Data

A reproducible Python pipeline that generates, cleans, and engineers features from raw incident data — end to end.

⚡

Data Generation

Seeded synthetic records with realistic JHB ward/suburb patterns and operational noise

🧹

Cleaning

Null imputation, duplicate removal, date parsing, and field standardisation

🔄

Transformation

Normalisation, schema alignment, relational key mapping across tables

⚙️

Feature Engineering

Time-to-repair, SLA flags, repeat-incident scores, hotspot density metrics

📈

Analysis

KPI reporting, geospatial clustering, contractor benchmarking, backlog trends

# ── generator.py ────────────────────────────────────────── def generate_incidents(seed=753, n_rows=200_000): rng = np.random.default_rng(seed) ward_weights = _load_ward_population_weights() # realistic skew df['reported_at'] = _simulate_timestamps(rng, n_rows) df['issue_type'] = rng.choice(['pothole', 'water_leak'], p=[.58, .42]) df['ward'] = rng.choice(JHB_WARDS, p=ward_weights) df = _inject_noise(df, null_rate=0.04, dupe_rate=0.02) # realistic messiness return df # ── cleaners.py ─────────────────────────────────────────── def clean_pipeline(df): df = _drop_structural_dupes(df) # 2% duplication removed df = _impute_nulls(df, strategy='ward_median') # context-aware fill df['reported_at'] = pd.to_datetime(df['reported_at'], errors='coerce') df = _standardise_suburbs(df, mapping=SUBURB_MAP) # 'Alex' → 'Alexandra' return df.reset_index(drop=True) # ── build_features.py ───────────────────────────────────── def engineer_features(df): df['time_to_repair_hrs'] = (df.resolved_at - df.reported_at).dt.total_seconds() / 3600 df['sla_breached'] = df.apply(_check_sla, axis=1) df['is_repeat_incident'] = _flag_repeats(df, window_days=90) df['hotspot_score'] = _compute_density(df, radius_m=500) return df

03 · Exploratory Analysis

What the Data Actually Says

Before jumping to conclusions, the data gets interrogated. Distributions, trends, and anomalies — surfaced before any modelling begins.

Incident Volume by Issue Type

Potholes account for 58% of all incidents, but water leaks have a 40% higher SLA breach rate — a critical asymmetry.

Key finding: Water leaks are under-resourced relative to their breach rate. Volume alone doesn't tell the urgency story.

Monthly Incident Trend (2024)

Seasonal spikes in January and February (rainy season) drive both water leak surges and pothole deterioration.

Key finding: The December–February surge is predictable yet consistently under-resourced. Pre-emptive crew allocation could reduce breach rates by ~18%.

Time-to-Repair Distribution (Hours)

The distribution is heavily right-skewed. Median is 101 hours, but the long tail pulls the mean to 187 hours — classic operational backlog signal.

Key finding: 15% of incidents take longer than 30 days to resolve. These are the cases dragging the mean — and the ones most likely to become repeat incidents.

Contractor SLA Compliance Rate

12 contractors. Same SLA rules. Wildly different compliance rates — from 91% down to 48%. The data makes the gaps unmissable.

Key finding: Contractors C and D account for 28% of incident assignments but 47% of all SLA breaches. Reallocation would have immediate measurable impact.

Top 15 Wards by Complaint Volume

Complaint density is highly concentrated. The top 8 wards generate 41% of all incidents — with Soweto, Alexandra, and Diepsloot consistently leading.

Key finding: Geographic concentration of complaints maps closely to infrastructure age — not just population density. Older suburbs have structurally higher incident rates independent of headcount.

04 · Urban Intelligence Insights

What the Patterns Are Telling Us

Analysis complete. What does this mean for the city, and what should decision-makers do about it?

Insight 01 Critical

Repeat Incidents Signal Infrastructure Failure, Not Bad Luck

22% of street segments appear in the dataset more than once within 90 days. These aren't random — they cluster in 6 wards and correlate with pipe age and road surface classification.

Operational Implication

Reactive repair is costing more than proactive replacement would. A targeted refurbishment programme in the top repeat-incident zones would reduce total incidents by an estimated 18% within 12 months.

Insight 02 Efficiency

Two Contractors Are Distorting the Entire SLA Picture

Remove contractors C and D from the dataset and the overall SLA breach rate drops from 34.7% to 21.3%. They're not the only problem — but they're the most fixable one.

Operational Implication

Performance-based contracting with breach penalties would create immediate accountability. This is a procurement decision, not a data problem — but it needed the data to become visible.

Insight 03 Seasonal

The Rainy Season Surge Is Predictable — Yet Always Catches the City Off Guard

January–March consistently sees a 2.4× spike in water leak incidents and a 1.7× spike in pothole reports. The pattern is identical across 2021, 2022, 2023, and 2024.

Operational Implication

Pre-emptive crew surge planning in November could absorb the January spike without SLA breaches. This is a scheduling problem solvable with a 3-month forecast model.

Insight 04 Hotspot

Three Suburbs Generate Disproportionate Backlog Volume

Soweto (Ward 14), Alexandra (Ward 82), and Diepsloot (Ward 95) together account for 23% of total incidents despite representing 16% of the service area. Backlog in these zones also ages the longest.

Operational Implication

A dedicated rapid-response unit for these three suburbs — even a small one — would produce outsized impact on citywide SLA compliance rates and resident satisfaction scores.

05 · Skills in Context

The Technical Choices Behind the Work

Skills don't live in a list. Here's what was actually used — and why.

🐍

Python

End-to-End Pipeline in Python

Every stage — generation, cleaning, feature engineering, analysis — runs in Python. The pipeline is modular: generator.py, cleaners.py, and build_features.py are independently testable. Reproducibility was a design constraint from the start.

Pandas / NumPyData manipulation

GeoPandasSpatial analysis

Matplotlib / SeabornVisualisation

🗃️

SQL · Data Modelling

Relational Schema First, Analysis Second

The schema was designed before a single row was generated. Separate tables for incidents, contractors, wards, and status lifecycle — joined cleanly via foreign keys. This makes the data model directly portable to a SQL warehouse or Power BI semantic layer.

Schema designRelational modelling

KPI definitionMetric engineering

🧼

Data Cleaning

Cleaning Like Real Data Demands

The synthetic dataset was deliberately made messy: 4% null rates, 2% duplicates, inconsistent suburb names ('Alex' vs 'Alexandra'), and malformed dates. The cleaning pipeline handles each class of error with a documented, reproducible strategy — not ad-hoc patches.

Null imputation (ward-median)

Deduplication logic

📐

Problem Framing

Starting with the Right Question

The project starts with a business problem, not a dataset. What does the City of Johannesburg need to make better resource allocation decisions? The KPIs, schema design, and feature engineering all flow from that framing. Analytical thinking before analytical tools.

KPI design to business need

Insight-to-action translation

06 · Technical Architecture

How It's Built

A reproducible, modular Python stack — designed to scale into a full BI and ML layer in Phase 2.

Layer 1 · Data Generation

NumPy (seeded RNG)

Faker (location names)

generator.py

Layer 2 · Cleaning & Validation

Pandas

cleaners.py

validation.py

Great Expectations

Layer 3 · Feature Engineering

build_features.py

GeoPandas

Shapely

Layer 4 · Analysis & Reporting

Jupyter Notebooks

Matplotlib

Seaborn

report.md

Phase 2 · BI Dashboard (Planned)

Power BI

DAX Measures

Semantic Layer

Phase 3 · ML Layer (Optional)

scikit-learn

Image Classification

Incident Linking

Language

Python 3.11

Core Libraries

Pandas

NumPy

GeoPandas

Shapely

Matplotlib

Seaborn

Environment

venv

Jupyter Lab

Git

Design Principles

Modular pipeline

Seeded reproducibility

Schema-first design

Data Scale

200K rows

114 wards

12 contractors

JHB MunicipalInfrastructure Intelligence

Johannesburg's Infrastructure Gap

4.2 days

34.7%

22%

8 wards

Geospatial Dimension

Temporal Signals

Issue Typology

Contractor Performance

200K Incident Records

Status Lifecycle

From Noise to Model-Ready Data

Data Generation

Cleaning

Transformation

Feature Engineering

Analysis

What the Data Actually Says

Incident Volume by Issue Type

Monthly Incident Trend (2024)

Time-to-Repair Distribution (Hours)

Contractor SLA Compliance Rate

Top 15 Wards by Complaint Volume

What the Patterns Are Telling Us

Repeat Incidents Signal Infrastructure Failure, Not Bad Luck

Two Contractors Are Distorting the Entire SLA Picture

The Rainy Season Surge Is Predictable — Yet Always Catches the City Off Guard

Three Suburbs Generate Disproportionate Backlog Volume

The Technical Choices Behind the Work

End-to-End Pipeline in Python

Relational Schema First, Analysis Second

Cleaning Like Real Data Demands

Starting with the Right Question

How It's Built

Language

Core Libraries

Environment

Design Principles

Data Scale

JHB Municipal
Infrastructure Intelligence