Methodology
How the data on this site is sourced, transformed, refreshed, and validated.
Primary sources
| Source | What we get | Refresh |
|---|---|---|
| Companies House — BasicCompanyDataAsOneFile ↗ | Core registry (5.7M companies) | Monthly |
| Companies House — Public Data API ↗ | Officers, filings, charges, PSCs | On-demand, 7-day cache |
| UK SIC 2007 ↗ | Industry classification reference | Static (SIC 2007 revision) |
| Open Government Licence v3.0 ↗ | Licence under which we use the data | n/a |
Refresh process
- On the first Monday of each month, a scheduled job discovers the latest
BasicCompanyDataAsOneFile-YYYY-MM-DD.zipfrom the Companies House download page. - The job downloads the ZIP (~500 MB) and extracts the underlying CSV (~2.7 GB, 5.7 million rows).
- Each row is transformed: dates parsed from DD/MM/YYYY to ISO, SIC codes extracted from the “12345 - Description” format, addresses flattened, previous names rolled into JSONB.
- Rows stream into a new table
companies_newvia Postgres COPY (no inserts go through ORM). - Indexes are built on the new table after loading completes (faster than incremental updates during load).
- A single atomic
ALTER TABLE … RENAMEswaps the new table in. The old table is preserved for 24 hours as a rollback safety. - SIC code aggregates are recomputed.
- Total runtime: under 5 minutes end-to-end on the production Hetzner box. The user-facing site stays available throughout.
Live data via the API
When a user opens a company profile, we fetch its officers, filings, charges, and persons-with-significant-control from the Companies House Public Data API. These responses are cached locally for seven days, so repeat visits don't re-hit the API.
Rate limit: 600 requests per 5 minutes per API key. With caching this is comfortably below what any real traffic load requires.
Known limitations
- The bulk dataset only contains companies currently on the live register. Companies fully dissolved are removed from this dataset, so dissolution counts on this site are conservative.
- The bulk dataset publishes once a month, so newly incorporated or recently dissolved companies (within ~30 days) may not yet reflect in our search index — but their profile pages will load fresh data from the live API.
- SIC code descriptions are extracted from the bulk dataset itself, so any new SIC text variations introduced by Companies House propagate through naturally.
- The site uses Postgres trigram fuzzy matching for typos, but very short queries (1–2 characters) won't return useful results.
- Postcodes containing non-postcode strings (Companies House allows values like “NOT APPLICABLE”) are passed through as-is, so the postcode filter occasionally surfaces unusual values.
Corrections
If you spot an error in our presentation of the data, email [email protected] and we'll fix or remove the issue within 48 hours.
For corrections to the underlying record — directors, status, registered office — you need to file with Companies House directly. We re-load the data each month and your changes propagate automatically.