About & Methodology

How data is collected, processed, validated, and kept up-to-date — and why you can trust it.

What is Country Classification Commons?

Country Classification Commons is an open, automated reference library that consolidates country-level classifications from the United Nations, World Bank, and OECD into a single, machine-readable dataset. It is designed for UN agencies, INGOs, academic researchers, students, data journalists, and anyone who regularly works with countries as units of analysis and needs consistent, authoritative classification metadata.

Instead of hunting across multiple agency websites, manually copy-pasting tables, or maintaining your own lookup files, you can reference a single stable URL that is refreshed automatically from the original sources.

Data sources

All data is fetched directly from authoritative public sources at every pipeline run. No manual editing is applied.

Source ID Organisation Dataset What it provides Access method
un_m49 UN Statistics Division (UNSD) M49 Overview ISO2, ISO3, M49 codes; UN geoscheme (region/sub-region/intermediate); LDC, LLDC, SIDS flags; country names in 6 UN languages HTML scrape (multilingual tabs)
un_sdg UN SDG Global Database SDG GeoArea API SDG geographic area codes; SDG regional hierarchy memberships JSON REST API
world_bank World Bank Open Data World Bank Country API v2 WB income level (Low / Lower Middle / Upper Middle / High), lending type (IDA/IBRD/Blend), WB region, capital city, latitude/longitude JSON REST API (paginated)
world_bank_fcs World Bank FCS Classification page Fragile & Conflict-affected Situations (FCS) list: country name, category (Conflict vs Institutional & Social Fragility), fiscal year Auto-discovered latest FY PDF → pdftotext parse
oecd_dac OECD Development Assistance Committee DAC List of ODA Recipients ODA recipient eligibility, DAC group (LDCs / LMICs / UMICs / Other LICs), World Bank income hint, reporting year Auto-discovered latest CSV from official OECD webfs directory

Data pipeline methodology

Authority ranking

When multiple sources provide overlapping information (e.g. country name, ISO codes), the pipeline follows this authority order:

  1. UN M49 — canonical ISO2, ISO3, M49 codes; UN geoscheme hierarchy; LDC/LLDC/SIDS flags; all six UN-language names
  2. UN SDG API — SDG geographic groupings and regional hierarchy
  3. World Bank API — economic classification (income level, lending type), capital cities, coordinates
  4. World Bank FCS page — latest fragility roster and category
  5. OECD DAC CSV — ODA recipient status and groupings

Processing steps

  1. Fetch all sources in parallel using requests.Session with a descriptive User-Agent.
  2. Parse UN M49 HTML page for all six language tabs; derive the canonical country list from the English tab (filter to 3-character ISO3 only). Merge multilingual names on [m49, iso3].
  3. Merge World Bank API response into the M49 base table on iso3 (left join — M49 is the authority).
  4. Merge UN SDG GeoArea list on m49; walk the tree JSON to derive region memberships.
  5. For the World Bank FCS PDF: auto-detect the latest FY PDF link, extract text with pdftotext, parse country names and categories from the structured section headers.
  6. For OECD DAC CSV: auto-detect the most recent CSV in the official webfs directory; read and normalise.
  7. Map FCS and OECD country names to ISO3 using Unicode-normalised string matching + a curated alias table. Unmapped names are recorded in unmapped_external_names.csv for transparency.
  8. Build the long-format memberships table from all sources. Emit one row per country–group pair with explicit source and group_type.
  9. Denormalize into country_classification_library by joining memberships back to the master table.
  10. Write CSV and JSON outputs; compute SHA-256 checksums; generate the run manifest.
  11. Compare with the previous run and write a human-readable changelog report.
  12. Copy output files into docs/data/ for GitHub Pages serving.

Name-matching for FCS and OECD

The WB FCS PDF and OECD DAC CSV do not publish ISO3 codes directly, so country names must be matched to ISO3. The pipeline applies deterministic Unicode normalisation (NFKD, accent-strip, punctuation-strip, lowercase) and then looks up the normalised name in an index built from the M49 and World Bank name columns. A curated alias table handles known divergences (e.g. "Türkiye" → TUR, "West Bank and Gaza Strip" → PSE). Any name that cannot be resolved is saved to unmapped_external_names.csv so coverage gaps are transparent.

Automation — GitHub Actions workflows

1. Update Data (update-data.yml)

This workflow runs the full data pipeline and commits any changes back to the repository.

TriggerSchedule (UTC)Purpose
Weekday morning05:17 Mon–FriCatch overnight upstream updates (WB API, M49 edits)
Weekday evening17:17 Mon–FriCatch same-day source revisions
Weekly safety refresh09:42 every SundayEnsure data is never more than 7 days old
Manual dispatchon demandForce a refresh at any time from GitHub Actions UI

The workflow checks out the repository, installs Python 3.11 + requirements.txt, runs scripts/update_data.py, stages the changed files under data/latest/, data/changelog/, data/history/, and docs/data/, and commits + pushes only if there is a diff. The commit message is always chore(data): automated refresh, making the history clean and auditable.

2. Deploy Pages (deploy-pages.yml)

Every push to main that touches any file under docs/ or the workflow file itself triggers this workflow. It uploads the docs/ folder as a GitHub Pages artifact and deploys it to https://mafiAtUN.github.io/country-classification-commons/. Concurrent runs are cancelled so only the latest version is ever deployed.

This means: when the data pipeline runs and finds changes, it commits → that commit triggers Pages deploy → within minutes the live website serves the updated data files.

Data freshness

Under normal conditions the data is refreshed up to 10 times per week (twice per weekday + once on Sunday). The exact "last refreshed" timestamp is embedded in every output file via run_manifest.json and visible in the Explorer header. Each run also archives a full snapshot to data/history/<snapshot-id>/, so historical versions are always available in the repository.

Change tracking & reproducibility

Per-run changelog

Every run generates a Markdown report at data/changelog/changes_<snapshot_id>.md listing:

  • Countries/areas added or removed since the previous run
  • Countries where core metadata changed (income level, FCS status, ODA eligibility, etc.)
  • Group memberships added or removed
  • Any country names from external sources (FCS, OECD) that could not be mapped to ISO3

SHA-256 checksums

The run_manifest.json file includes a SHA-256 hash for every output file. This lets you verify that the file you downloaded matches exactly what the pipeline produced, and detect any transmission or storage corruption.

Full history in git

Every automated commit is versioned in the repository's git history. You can check out any past commit to reproduce the exact dataset from that point in time. Snapshots are also copied to data/history/ for convenient access without git.

Limitations & caveats

Citation & license

This project is open source under the MIT License. The compiled dataset is derived from public data published by UNSD, World Bank, and OECD. Please also cite the original sources when using the data in publications:

Repository: github.com/MafiAtUN/country-classification-commons