TDA Method

Mapper

Drafting

What Is Mapper?

Think of the Mapper algorithm as making a transit map of your data. A physical transit map does not show every building and street; it schematises the city, showing major stations connected by lines, making visible structural relationships that a satellite photograph would obscure. Mapper does the same for high-dimensional data: it produces a network (graph) whose vertices are clusters of data points and whose edges represent cluster overlaps, preserving the topological “shape” of the data while discarding irrelevant detail.

The result is legible and interpretable: clusters become nodes, structural bridges become edges, and the global layout of the network reflects the global shape of the data manifold.

Mathematical Formulation

Given a data set XX with a lens function f:XRkf: X \to \mathbb{R}^k and a cover U={Ui}\mathcal{U} = \{U_i\} of f(X)f(X), the Mapper graph M(X,f,U,clust)M(X, f, \mathcal{U}, \text{clust}) is constructed as follows:

  1. For each UiUU_i \in \mathcal{U}, form the preimage f1(Ui)Xf^{-1}(U_i) \subseteq X
  2. Cluster f1(Ui)f^{-1}(U_i) using a clustering algorithm (typically single-linkage): produces clusters {Cij}\{C_{ij}\}
  3. Construct the nerve of the cover: vertices are clusters CijC_{ij}; edges connect CijC_{ij} and CklC_{kl} when CijCklC_{ij} \cap C_{kl} \neq \emptyset

The Mapper graph is the 1-skeleton of this nerve:

M=({Cij},  {{Cij,Ckl}:CijCkl})M = \left(\{C_{ij}\},\; \{\{C_{ij}, C_{kl}\} : C_{ij} \cap C_{kl} \neq \emptyset\}\right)

The choice of lens function ff determines what structure Mapper highlights. In this programme, ff is the 2nd-nearest-neighbour distance lens (projection="knn_distance_2"), which maps each point to its distance to its second closest neighbour and thereby highlights low-density regions and topological bridges in the trajectory space.

Python Code Stub

import kmapper as km
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import MinMaxScaler
from typing import Any, Dict

def run_mapper(
    X: np.ndarray,
    n_cubes: int = 15,
    overlap: float = 0.5,
) -> Dict[str, Any]:
    """Apply Mapper to a feature matrix X.

    Returns the graph dictionary produced by km.KeplerMapper.map(),
    with keys 'nodes' (dict of node_id -> list of point indices) and
    'links' (dict of node_id -> list of connected node_ids).
    """
    mapper = km.KeplerMapper(verbose=0)
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)

    # Lens: 2nd nearest-neighbour distance (knn_distance_2), highlighting low-density bridges
    lens = mapper.fit_transform(X_scaled, projection="knn_distance_2")

    # Cover with n_cubes intervals and given overlap
    cover = km.Cover(n_cubes=n_cubes, perc_overlap=overlap)

    # Cluster within each cover element
    graph = mapper.map(
        lens,
        X_scaled,
        cover=cover,
        clusterer=AgglomerativeClustering(n_clusters=None, linkage='single', distance_threshold=0.5),
    )
    return graph

Application to the Research Programme

Mapper is used in Paper 2 to reveal the interior topological structure of UK employment trajectory space, identifying five robust topological clusters including a distinctive flare sub-population corresponding to early labour-market exit. Bootstrap stability analysis (1000 draws) confirms that the five-cluster topology is structurally stable across parameter choices.

The Mapper graph representation is also used in Paper 5 for cross-national comparison of employment trajectory topology and in Paper 7 as a node embedding feature for the geometric forecasting model.

Parameter sensitivity (number of intervals, overlap percentage, clustering threshold) is systematically assessed in each application; the stability analysis protocol is described in Paper 2’s supplementary materials.