What is Country Classification Commons?
Country Classification Commons is an open, automated reference library that consolidates country-level classifications from the United Nations, World Bank, and OECD into a single, machine-readable dataset. It is designed for UN agencies, INGOs, academic researchers, students, data journalists, and anyone who regularly works with countries as units of analysis and needs consistent, authoritative classification metadata.
Instead of hunting across multiple agency websites, manually copy-pasting tables, or maintaining your own lookup files, you can reference a single stable URL that is refreshed automatically from the original sources.
Data sources
All data is fetched directly from authoritative public sources at every pipeline run. No manual editing is applied.
| Source ID | Organisation | Dataset | What it provides | Access method |
|---|---|---|---|---|
un_m49 |
UN Statistics Division (UNSD) | M49 Overview | ISO2, ISO3, M49 codes; UN geoscheme (region/sub-region/intermediate); LDC, LLDC, SIDS flags; country names in 6 UN languages | HTML scrape (multilingual tabs) |
un_sdg |
UN SDG Global Database | SDG GeoArea API | SDG geographic area codes; SDG regional hierarchy memberships | JSON REST API |
world_bank |
World Bank Open Data | World Bank Country API v2 | WB income level (Low / Lower Middle / Upper Middle / High), lending type (IDA/IBRD/Blend), WB region, capital city, latitude/longitude | JSON REST API (paginated) |
world_bank_fcs |
World Bank | FCS Classification page | Fragile & Conflict-affected Situations (FCS) list: country name, category (Conflict vs Institutional & Social Fragility), fiscal year | Auto-discovered latest FY PDF → pdftotext parse |
oecd_dac |
OECD Development Assistance Committee | DAC List of ODA Recipients | ODA recipient eligibility, DAC group (LDCs / LMICs / UMICs / Other LICs), World Bank income hint, reporting year | Auto-discovered latest CSV from official OECD webfs directory |
Data pipeline methodology
Authority ranking
When multiple sources provide overlapping information (e.g. country name, ISO codes), the pipeline follows this authority order:
- UN M49 — canonical ISO2, ISO3, M49 codes; UN geoscheme hierarchy; LDC/LLDC/SIDS flags; all six UN-language names
- UN SDG API — SDG geographic groupings and regional hierarchy
- World Bank API — economic classification (income level, lending type), capital cities, coordinates
- World Bank FCS page — latest fragility roster and category
- OECD DAC CSV — ODA recipient status and groupings
Processing steps
- Fetch all sources in parallel using
requests.Sessionwith a descriptive User-Agent. - Parse UN M49 HTML page for all six language tabs; derive the canonical country list from the English tab (filter to 3-character ISO3 only). Merge multilingual names on
[m49, iso3]. - Merge World Bank API response into the M49 base table on
iso3(left join — M49 is the authority). - Merge UN SDG GeoArea list on
m49; walk the tree JSON to derive region memberships. - For the World Bank FCS PDF: auto-detect the latest FY PDF link, extract text with
pdftotext, parse country names and categories from the structured section headers. - For OECD DAC CSV: auto-detect the most recent CSV in the official webfs directory; read and normalise.
- Map FCS and OECD country names to ISO3 using Unicode-normalised string matching + a curated alias table. Unmapped names are recorded in
unmapped_external_names.csvfor transparency. - Build the long-format memberships table from all sources. Emit one row per country–group pair with explicit
sourceandgroup_type. - Denormalize into
country_classification_libraryby joining memberships back to the master table. - Write CSV and JSON outputs; compute SHA-256 checksums; generate the run manifest.
- Compare with the previous run and write a human-readable changelog report.
- Copy output files into
docs/data/for GitHub Pages serving.
Name-matching for FCS and OECD
The WB FCS PDF and OECD DAC CSV do not publish ISO3 codes directly, so country names must be matched to ISO3.
The pipeline applies deterministic Unicode normalisation (NFKD, accent-strip, punctuation-strip, lowercase) and then
looks up the normalised name in an index built from the M49 and World Bank name columns.
A curated alias table handles known divergences (e.g. "Türkiye" → TUR, "West Bank and Gaza Strip" → PSE).
Any name that cannot be resolved is saved to unmapped_external_names.csv so coverage gaps are transparent.
Automation — GitHub Actions workflows
1. Update Data (update-data.yml)
This workflow runs the full data pipeline and commits any changes back to the repository.
| Trigger | Schedule (UTC) | Purpose |
|---|---|---|
| Weekday morning | 05:17 Mon–Fri | Catch overnight upstream updates (WB API, M49 edits) |
| Weekday evening | 17:17 Mon–Fri | Catch same-day source revisions |
| Weekly safety refresh | 09:42 every Sunday | Ensure data is never more than 7 days old |
| Manual dispatch | on demand | Force a refresh at any time from GitHub Actions UI |
The workflow checks out the repository, installs Python 3.11 + requirements.txt,
runs scripts/update_data.py, stages the changed files under data/latest/,
data/changelog/, data/history/, and docs/data/,
and commits + pushes only if there is a diff.
The commit message is always chore(data): automated refresh, making the history clean and auditable.
2. Deploy Pages (deploy-pages.yml)
Every push to main that touches any file under docs/ or the workflow file itself
triggers this workflow. It uploads the docs/ folder as a GitHub Pages artifact and deploys it
to https://mafiAtUN.github.io/country-classification-commons/.
Concurrent runs are cancelled so only the latest version is ever deployed.
This means: when the data pipeline runs and finds changes, it commits → that commit triggers Pages deploy → within minutes the live website serves the updated data files.
Data freshness
Under normal conditions the data is refreshed up to 10 times per week (twice per weekday + once on Sunday).
The exact "last refreshed" timestamp is embedded in every output file via run_manifest.json and visible in the Explorer header.
Each run also archives a full snapshot to data/history/<snapshot-id>/, so historical versions are always available in the repository.
Change tracking & reproducibility
Per-run changelog
Every run generates a Markdown report at data/changelog/changes_<snapshot_id>.md listing:
- Countries/areas added or removed since the previous run
- Countries where core metadata changed (income level, FCS status, ODA eligibility, etc.)
- Group memberships added or removed
- Any country names from external sources (FCS, OECD) that could not be mapped to ISO3
SHA-256 checksums
The run_manifest.json file includes a SHA-256 hash for every output file.
This lets you verify that the file you downloaded matches exactly what the pipeline produced,
and detect any transmission or storage corruption.
Full history in git
Every automated commit is versioned in the repository's git history.
You can check out any past commit to reproduce the exact dataset from that point in time.
Snapshots are also copied to data/history/ for convenient access without git.
Limitations & caveats
- Source lag: The pipeline reflects the state of upstream sources at the time of the last run. If a source organisation updates their classification mid-day, the change will appear in the next pipeline run (at most ~12 hours on weekdays).
- FCS PDF parsing: The World Bank FCS list is published only as a PDF. The pipeline uses
pdftotextto extract text and parses the output using section headers. If the PDF layout changes, parsing may fail or produce incorrect results — unmapped names would appear inunmapped_external_names.csv. - OECD DAC CSV format: The pipeline auto-selects the most recent CSV in the OECD webfs directory. Column names are assumed stable; a format change could require a script update.
- Coverage: Only countries/areas that appear in the UN M49 overview with a valid 3-character ISO3 code are included. Sub-national and provisional territories without ISO3 codes are excluded.
- Boolean columns in CSV:
True/Falsestrings (capitalised) are used. In JSON, proper JSON booleans are used. - Non-member UN territories: The dataset covers UN statistical countries/areas, which includes some non-member territories (e.g. Taiwan, Kosovo, Western Sahara) under their ISO3 codes.
Citation & license
This project is open source under the MIT License. The compiled dataset is derived from public data published by UNSD, World Bank, and OECD. Please also cite the original sources when using the data in publications:
- United Nations Statistics Division. M49 Standard country or area codes for statistical use. unstats.un.org
- World Bank. World Bank Country and Lending Groups. World Bank Open Data
- World Bank. Harmonized List of Fragile Situations. worldbank.org
- OECD. DAC List of ODA Recipients. oecd.org
Repository: github.com/MafiAtUN/country-classification-commons