The problem
Public records are public in name only. They exist in thousands of incompatible formats — PDFs scanned from paper, FTP servers from the early 2000s, Excel sheets emailed between departments, proprietary database exports that require software no one has a license for anymore.
Researchers waste months acquiring data that should take hours. Journalists miss stories because the signal is buried in noise. Civic technologists build on unstable foundations because the underlying data is never guaranteed to be consistent.
GovData is the infrastructure layer that was missing.
The hardest design problem wasn't the interface — it was convincing data that didn't know it was related that it was the same thing.
Approach
The system is built around three phases: acquisition, normalisation, and publication.
Acquisition — automated scrapers and agency partnerships bring in raw records across 40+ source formats. Every record is checksummed at ingestion so provenance is always traceable.
Normalisation — an ML pipeline classifies, deduplicates, and schemas the incoming data. Edge cases are surfaced to a small human review queue rather than silently dropped.
Publication — a versioned, queryable API with a schema-first design. Every dataset has a changelog. Consumers can pin to a version and get notified of breaking changes.
The designer's job here was primarily information architecture: how do you make a system this complex comprehensible to a researcher who just wants to download a CSV?
The interface
GovData defaults to light mode — it's a professional tool used in bright office environments, not a late-night dashboard. The blue accent sits in a WCAG AA-compliant contrast ratio against the light surface.
The dataset explorer is the centrepiece: a split view with schema on the left and live preview on the right. Every field is described, typed, and linked to its provenance. The search layer understands natural language queries and maps them to structured filters.
We designed for the three core personas — the researcher who knows exactly what they want, the journalist exploring an unfamiliar dataset, and the engineer who needs to understand the API contract — without making any of them feel like they're using the wrong product.
Outcomes
GovData is now the data layer behind three investigative journalism projects and is in active use by two university research groups. The structured changelog has been cited as the feature that distinguishes it from raw FOIA dumps — users know exactly what changed and when.