About & Methodology | Country Classification Commons

What is Country Classification Commons?

Country Classification Commons is an open, automated reference library that consolidates country-level classifications from the United Nations, World Bank, and OECD into a single, machine-readable dataset. It is designed for UN agencies, INGOs, academic researchers, students, data journalists, and anyone who regularly works with countries as units of analysis and needs consistent, authoritative classification metadata.

Instead of hunting across multiple agency websites, manually copy-pasting tables, or maintaining your own lookup files, you can reference a single stable URL that is refreshed automatically from the original sources.

Data sources

All data is fetched directly from authoritative public sources at every pipeline run. No manual editing is applied.

Source ID	Organisation	Dataset	What it provides	Access method
`un_m49`	UN Statistics Division (UNSD)	M49 Overview	ISO2, ISO3, M49 codes; UN geoscheme (region/sub-region/intermediate); LDC, LLDC, SIDS flags; country names in 6 UN languages	HTML scrape (multilingual tabs)
`un_sdg`	UN SDG Global Database	SDG GeoArea API	SDG geographic area codes; SDG regional hierarchy memberships	JSON REST API
`world_bank`	World Bank Open Data	World Bank Country API v2	WB income level (Low / Lower Middle / Upper Middle / High), lending type (IDA/IBRD/Blend), WB region, capital city, latitude/longitude	JSON REST API (paginated)
`world_bank_fcs`	World Bank	FCS Classification page	Fragile & Conflict-affected Situations (FCS) list: country name, category (Conflict vs Institutional & Social Fragility), fiscal year	Auto-discovered latest FY PDF → pdftotext parse
`oecd_dac`	OECD Development Assistance Committee	DAC List of ODA Recipients	ODA recipient eligibility, DAC group (LDCs / LMICs / UMICs / Other LICs), World Bank income hint, reporting year	Auto-discovered latest CSV from official OECD webfs directory

Data pipeline methodology

Authority ranking

When multiple sources provide overlapping information (e.g. country name, ISO codes), the pipeline follows this authority order:

UN M49 — canonical ISO2, ISO3, M49 codes; UN geoscheme hierarchy; LDC/LLDC/SIDS flags; all six UN-language names
UN SDG API — SDG geographic groupings and regional hierarchy
World Bank API — economic classification (income level, lending type), capital cities, coordinates
World Bank FCS page — latest fragility roster and category
OECD DAC CSV — ODA recipient status and groupings

Processing steps

Fetch all sources in parallel using requests.Session with a descriptive User-Agent.
Parse UN M49 HTML page for all six language tabs; derive the canonical country list from the English tab (filter to 3-character ISO3 only). Merge multilingual names on [m49, iso3].
Merge World Bank API response into the M49 base table on iso3 (left join — M49 is the authority).
Merge UN SDG GeoArea list on m49; walk the tree JSON to derive region memberships.
For the World Bank FCS PDF: auto-detect the latest FY PDF link, extract text with pdftotext, parse country names and categories from the structured section headers.
For OECD DAC CSV: auto-detect the most recent CSV in the official webfs directory; read and normalise.
Map FCS and OECD country names to ISO3 using Unicode-normalised string matching + a curated alias table. Unmapped names are recorded in unmapped_external_names.csv for transparency.
Build the long-format memberships table from all sources. Emit one row per country–group pair with explicit source and group_type.
Denormalize into country_classification_library by joining memberships back to the master table.
Write CSV and JSON outputs; compute SHA-256 checksums; generate the run manifest.
Compare with the previous run and write a human-readable changelog report.
Copy output files into docs/data/ for GitHub Pages serving.

Name-matching for FCS and OECD

The WB FCS PDF and OECD DAC CSV do not publish ISO3 codes directly, so country names must be matched to ISO3. The pipeline applies deterministic Unicode normalisation (NFKD, accent-strip, punctuation-strip, lowercase) and then looks up the normalised name in an index built from the M49 and World Bank name columns. A curated alias table handles known divergences (e.g. "Türkiye" → TUR, "West Bank and Gaza Strip" → PSE). Any name that cannot be resolved is saved to unmapped_external_names.csv so coverage gaps are transparent.

Automation — GitHub Actions workflows

1. Update Data (`update-data.yml`)

This workflow runs the full data pipeline and commits any changes back to the repository.

Trigger	Schedule (UTC)	Purpose
Weekday morning	05:17 Mon–Fri	Catch overnight upstream updates (WB API, M49 edits)
Weekday evening	17:17 Mon–Fri	Catch same-day source revisions
Weekly safety refresh	09:42 every Sunday	Ensure data is never more than 7 days old
Manual dispatch	on demand	Force a refresh at any time from GitHub Actions UI

The workflow checks out the repository, installs Python 3.11 + requirements.txt, runs scripts/update_data.py, stages the changed files under data/latest/, data/changelog/, data/history/, and docs/data/, and commits + pushes only if there is a diff. The commit message is always chore(data): automated refresh, making the history clean and auditable.

2. Deploy Pages (`deploy-pages.yml`)

Every push to main that touches any file under docs/ or the workflow file itself triggers this workflow. It uploads the docs/ folder as a GitHub Pages artifact and deploys it to https://mafiAtUN.github.io/country-classification-commons/. Concurrent runs are cancelled so only the latest version is ever deployed.

This means: when the data pipeline runs and finds changes, it commits → that commit triggers Pages deploy → within minutes the live website serves the updated data files.

Data freshness

Under normal conditions the data is refreshed up to 10 times per week (twice per weekday + once on Sunday). The exact "last refreshed" timestamp is embedded in every output file via run_manifest.json and visible in the Explorer header. Each run also archives a full snapshot to data/history/<snapshot-id>/, so historical versions are always available in the repository.

Change tracking & reproducibility

Per-run changelog

Every run generates a Markdown report at data/changelog/changes_<snapshot_id>.md listing:

Countries/areas added or removed since the previous run
Countries where core metadata changed (income level, FCS status, ODA eligibility, etc.)
Group memberships added or removed
Any country names from external sources (FCS, OECD) that could not be mapped to ISO3

SHA-256 checksums

The run_manifest.json file includes a SHA-256 hash for every output file. This lets you verify that the file you downloaded matches exactly what the pipeline produced, and detect any transmission or storage corruption.

Full history in git

Every automated commit is versioned in the repository's git history. You can check out any past commit to reproduce the exact dataset from that point in time. Snapshots are also copied to data/history/ for convenient access without git.

Limitations & caveats

Source lag: The pipeline reflects the state of upstream sources at the time of the last run. If a source organisation updates their classification mid-day, the change will appear in the next pipeline run (at most ~12 hours on weekdays).
FCS PDF parsing: The World Bank FCS list is published only as a PDF. The pipeline uses pdftotext to extract text and parses the output using section headers. If the PDF layout changes, parsing may fail or produce incorrect results — unmapped names would appear in unmapped_external_names.csv.
OECD DAC CSV format: The pipeline auto-selects the most recent CSV in the OECD webfs directory. Column names are assumed stable; a format change could require a script update.
Coverage: Only countries/areas that appear in the UN M49 overview with a valid 3-character ISO3 code are included. Sub-national and provisional territories without ISO3 codes are excluded.
Boolean columns in CSV: True/False strings (capitalised) are used. In JSON, proper JSON booleans are used.
Non-member UN territories: The dataset covers UN statistical countries/areas, which includes some non-member territories (e.g. Taiwan, Kosovo, Western Sahara) under their ISO3 codes.

Citation & license

This project is open source under the MIT License. The compiled dataset is derived from public data published by UNSD, World Bank, and OECD. Please also cite the original sources when using the data in publications:

United Nations Statistics Division. M49 Standard country or area codes for statistical use. unstats.un.org
World Bank. World Bank Country and Lending Groups. World Bank Open Data
World Bank. Harmonized List of Fragile Situations. worldbank.org
OECD. DAC List of ODA Recipients. oecd.org

Repository: github.com/MafiAtUN/country-classification-commons