Translating municipal incident data into operational decisions — tracking pothole and water leak patterns across Johannesburg wards to surface SLA breaches, repeat-incident hotspots, and contractor accountability gaps.
A city of 6 million people. Ageing pipes. Crumbling roads. And a service delivery system that often doesn't know where to look first.
Johannesburg's infrastructure is under sustained pressure. Every day, residents log hundreds of complaints about water leakages and road potholes — but without proper data infrastructure, the City can't separate chronic hotspots from one-off events, or track whether repairs are actually happening within SLA windows.
The result: resources get allocated reactively. The same street gets a complaint every three months. The same pipe zone floods twice a year. Contractors are dispatched blind. SLA breach rates creep up silently.
This project builds the analytics layer that should sit between raw complaint data and operations. The core questions: where are the real hotspots, who's breaching SLAs, and what patterns are hiding in the backlog?
The dataset is synthetic but shaped to mirror real Johannesburg operational patterns — ward boundaries, contractor names, seasonal timing, and the kinds of messy, incomplete records that real municipal data produces.
Ward, suburb, and street-level granularity across all 114 JHB wards. Enables hotspot mapping and geographic clustering.
GeoPandasReport date, first-response timestamp, resolution date — enabling time-to-repair calculations and seasonal trend detection.
Time SeriesTwo primary categories: potholes and water leakages. Each has distinct SLA windows, crew types, and seasonal drivers.
Classification12 contractors tracked across resolution speed, SLA compliance, and repeat-incident recurrence rates per assignment.
RelationalSynthetically generated with realistic messiness — nulls, duplicates, inconsistent formatting — to simulate real municipal data quality.
Synthetic · SeededOpen → In Progress → Resolved → Closed. Plus: re-opened incidents that signal poor-quality first-time fixes.
WorkflowA reproducible Python pipeline that generates, cleans, and engineers features from raw incident data — end to end.
Seeded synthetic records with realistic JHB ward/suburb patterns and operational noise
Null imputation, duplicate removal, date parsing, and field standardisation
Normalisation, schema alignment, relational key mapping across tables
Time-to-repair, SLA flags, repeat-incident scores, hotspot density metrics
KPI reporting, geospatial clustering, contractor benchmarking, backlog trends
Before jumping to conclusions, the data gets interrogated. Distributions, trends, and anomalies — surfaced before any modelling begins.
Key finding: Water leaks are under-resourced relative to their breach rate. Volume alone doesn't tell the urgency story.
Key finding: The December–February surge is predictable yet consistently under-resourced. Pre-emptive crew allocation could reduce breach rates by ~18%.
Key finding: 15% of incidents take longer than 30 days to resolve. These are the cases dragging the mean — and the ones most likely to become repeat incidents.
Key finding: Contractors C and D account for 28% of incident assignments but 47% of all SLA breaches. Reallocation would have immediate measurable impact.
Key finding: Geographic concentration of complaints maps closely to infrastructure age — not just population density. Older suburbs have structurally higher incident rates independent of headcount.
Analysis complete. What does this mean for the city, and what should decision-makers do about it?
22% of street segments appear in the dataset more than once within 90 days. These aren't random — they cluster in 6 wards and correlate with pipe age and road surface classification.
Reactive repair is costing more than proactive replacement would. A targeted refurbishment programme in the top repeat-incident zones would reduce total incidents by an estimated 18% within 12 months.
Remove contractors C and D from the dataset and the overall SLA breach rate drops from 34.7% to 21.3%. They're not the only problem — but they're the most fixable one.
Performance-based contracting with breach penalties would create immediate accountability. This is a procurement decision, not a data problem — but it needed the data to become visible.
January–March consistently sees a 2.4× spike in water leak incidents and a 1.7× spike in pothole reports. The pattern is identical across 2021, 2022, 2023, and 2024.
Pre-emptive crew surge planning in November could absorb the January spike without SLA breaches. This is a scheduling problem solvable with a 3-month forecast model.
Soweto (Ward 14), Alexandra (Ward 82), and Diepsloot (Ward 95) together account for 23% of total incidents despite representing 16% of the service area. Backlog in these zones also ages the longest.
A dedicated rapid-response unit for these three suburbs — even a small one — would produce outsized impact on citywide SLA compliance rates and resident satisfaction scores.
Skills don't live in a list. Here's what was actually used — and why.
Every stage — generation, cleaning, feature engineering, analysis — runs in Python. The pipeline is modular: generator.py, cleaners.py, and build_features.py are independently testable. Reproducibility was a design constraint from the start.
The schema was designed before a single row was generated. Separate tables for incidents, contractors, wards, and status lifecycle — joined cleanly via foreign keys. This makes the data model directly portable to a SQL warehouse or Power BI semantic layer.
The synthetic dataset was deliberately made messy: 4% null rates, 2% duplicates, inconsistent suburb names ('Alex' vs 'Alexandra'), and malformed dates. The cleaning pipeline handles each class of error with a documented, reproducible strategy — not ad-hoc patches.
The project starts with a business problem, not a dataset. What does the City of Johannesburg need to make better resource allocation decisions? The KPIs, schema design, and feature engineering all flow from that framing. Analytical thinking before analytical tools.
A reproducible, modular Python stack — designed to scale into a full BI and ML layer in Phase 2.