Engineering notes

How the linked flow map and clustergram talk to each other, and the plumbing underneath: data pipeline, widget architecture, observable store, alpha-shape neighborhoods, and the standalone HTML builds.

1. Architecture at a glance

Two Jupyter widgets sit side-by-side in an HBox, both rendered via anywidget (so they ship as ES modules that work in JupyterLab, the standalone HTML exports, and anywhere ipywidgets renders):

┌──────────────────── HBox (flow ↔ clustergram) ────────────────────┐ │ │ │ ┌─── BikeFlowMapWidget ───┐ ┌─── celldega.viz.Clustergram ─┐ │ │ │ │ │ │ │ │ │ deck.gl layers: │ │ Matrix component: │ │ │ │ ├─ basemap tiles │ │ ├─ heatmap + dendrograms │ │ │ │ ├─ alpha-shape polys │ │ ├─ manual category bars │ │ │ │ ├─ flow lines │ │ └─ axis labels │ │ │ │ └─ station scatter │ │ │ │ │ │ │ │ │ │ │ │ Observable store │ │ Internal JS store │ │ │ └──────────┬──────────────┘ └───────┬───────────────────────┘ │ │ │ │ │ │ │ jsdlink() in Python │ │ │ │ (traitlets ↔ traitlets) │ │ │ └──────── click_info, selected_rows/cols, … ────────────┘ │ │ │ └───────────────────────────────────────────────────────────────────────┘

Everything visible to the user lives in the browser: two JS bundles, two anywidget models, and a bidirectional traitlet bridge called jsdlink that mirrors selected Python traitlets across both widgets so one can react to the other without involving the kernel.

2. Data pipeline

Each bike-share operator (Citi Bike NYC, Bluebikes Boston, Capital Bikeshare DC, Divvy Chicago) publishes monthly trip CSVs to a public S3 bucket. The Python package (bike_network_traffic.data) does the fetch + transform:

Resolve a (city, year, month) tuple to one or more S3 URLs by parsing the bucket's index.html (just a list of <Key> entries). Month can be a single int, a list, or None for a whole year.
Download each monthly zip into ~/.cache/bike_network_traffic/, extract the trip CSV(s), and concatenate them. Streaming HTTP with a per-chunk progress indicator is off by default so notebooks stay quiet.
Normalize schemas. Each operator went through a rename in 2021-ish (start_station_id vs from_station_id, etc.); we map them to a single canonical layout.
Build a station metadata frame (station_id, station_name, lat, lng) and the destination-probability matrix: for each origin station s, the row is P(destination = d | origin = s), obtained by group-counting trip endpoints and normalizing.

That probability matrix is our feature table for Celldega. Rows are origin stations, columns are destination stations, values sum to 1 per row — the same shape you'd feed into a clustergram of expression data.

Clustering & UMAP

viz.make_station_clustergram runs hierarchical clustering via Celldega's clust.Matrix to produce both a rendered Clustergram widget and a {station_name: cluster_id} map. That map is the join key that lets the map render each station with its cluster color and compute one alpha-shape per cluster.

Separately, the probability matrix goes through a small scanpy pipeline (anndata → neighbors → UMAP) to get a 2-D embedding of each origin station by its behavioral similarity to others. Stations that tend to send riders to the same destinations end up near each other in UMAP space even if they sit on opposite sides of the city. That embedding becomes the second coordinate system on the map (see §8).

3. The two widgets

BikeFlowMapWidget (ours)

src/bike_network_traffic/widget.py is a tiny anywidget.AnyWidget subclass. Its whole job is to declare a set of traitlets and point at a bundled ES module:

class BikeFlowMapWidget(anywidget.AnyWidget):
    _esm = BUNDLE_PATH / "widget.js"

    stations           = traitlets.List(default_value=[]).tag(sync=True)
    edge_index         = traitlets.Dict(default_value={}).tag(sync=True)
    click_info         = traitlets.Dict(default_value={}).tag(sync=True)
    selected_rows      = traitlets.List(default_value=[]).tag(sync=True)
    selected_cols      = traitlets.List(default_value=[]).tag(sync=True)
    matrix_axis_slice  = traitlets.Dict(default_value={}).tag(sync=True)
    palette_rgb        = traitlets.List(default_value=[]).tag(sync=True)
    spatial_mix        = traitlets.Float(0.0).tag(sync=True)
    cluster_polygons   = traitlets.Dict(default_value={}).tag(sync=True)
    alpha_index        = traitlets.Int(4).tag(sync=True)
    show_neighborhoods = traitlets.Bool(True).tag(sync=True)
    show_stations      = traitlets.Bool(True).tag(sync=True)
    # … plus cg_row_names / cg_col_names / matrix_slice_request_out, for jsdlink.

Everything tag(sync=True) is mirrored into the front-end model, and any JS-side model.set(…) call mirrors right back to Python. The JS bundle (js/bike_flow_map_widget.mjs, built with esbuild into widget.js) does all the rendering and interaction. The Python class almost never has to touch that state — the one Python-side observer is matrix_slice_request_out, which forwards map click requests into Celldega's matrix_slice_request traitlet.

Celldega Clustergram

Celldega's viz.Clustergram is a much larger widget — heatmap, dendrograms, row/col manual category bars, row/col axis sliders. It emits a handful of Python-observable events that we care about, all via a single click_info traitlet that holds a {type, value} payload:

row_label / col_label — clicked a station axis label
mat_value — clicked a matrix cell (two stations + a probability)
row_dendro / col_dendro — selected a dendrogram branch
cat_value — clicked a manual category bar entry

Plus matrix_axis_slice, which returns a sorted top-k slice of the matrix for a given row or column — this is what we use to draw the flow lines when a station is clicked, without having to ship the whole probability matrix to the browser.

4. Clustergram → map linkage

Python never runs during a user interaction — all linkage is client-side. The wiring lives in one small helper:

def link_flow_to_clustergram(flow, cgm):
    jsdlink((cgm, "click_info"),         (flow, "click_info"))
    jsdlink((cgm, "selected_rows"),      (flow, "selected_rows"))
    jsdlink((cgm, "selected_cols"),      (flow, "selected_cols"))
    jsdlink((cgm, "matrix_axis_slice"),  (flow, "matrix_axis_slice"))
    jsdlink((cgm, "row_names"),          (flow, "cg_row_names"))
    jsdlink((cgm, "col_names"),          (flow, "cg_col_names"))
    jsdlink((flow, "matrix_slice_request_out"), (cgm, "matrix_slice_request"))

jsdlink is ipywidgets' “JavaScript-side directional link”: it registers a front-end listener that copies one widget's traitlet to another's without a kernel round-trip. When the Clustergram pushes a new click_info, our widget's change:click_info handler fires within the same animation frame.

Direction: overwhelmingly Clustergram → flow map. The one reverse channel is matrix_slice_request_out: clicking a station on the map asks the Clustergram to return a row-and-column slice for that station, which then comes back as matrix_axis_slice and drives the flow lines.

5. Observable store on the JS side

The map widget's frontend is a small hand-rolled reactive store — essentially a tree of one-line observables rather than React/Redux. Each slot has get(), set(newValue), and subscribe(fn); setters skip no-op updates, and subscribers get the immediate value when they register (opt-in):

const Observable = (initialValue) => {
  let value = initialValue;
  const subscribers = new Set();
  return {
    get: () => value,
    set: (v) => { if (value === v) return; value = v; subscribers.forEach((fn) => fn(v)); },
    subscribe: (fn, { immediate = true } = {}) => {
      subscribers.add(fn);
      if (immediate) fn(value);
      return () => subscribers.delete(fn);
    },
  };
};

The store is created once per render() call and holds both mirrors of Python traitlets (stations, click_info, cluster_polygons, spatial_mix, alpha_index, …) and derived / pure-JS state (focus, highlights, edges, hovered_cluster, pinned_cluster).

All writes are batched into a single requestAnimationFrame via scheduleRender(), which runs a 3-step pipeline:

inputs synced — the Python-side traitlets have already been copied into the store by model.on("change:…") handlers.
compute derived state — computeStateFromInputs() inspects click_info.type and the axis-slice traitlets to decide what the map should show: focus station, a list of highlights, and the edges array. This is where a row_label click becomes “station N is the focus, 24 top-neighbor edges go in/out”.
prepare deck props — buildLayers() produces the layer array and then either boots a new Deck instance or calls deck.setProps(…) to update an existing one.

A small state machine (deck_check: {inputs, computed, layers}) gates the actual setProps call, so updates triggered by multiple traitlet changes in the same tick produce one redraw, not three.

6. deck.gl layer stack & picking

buildLayers() returns these, in order — later layers render on top:

basemap (TileLayer + BitmapLayer) — Carto light tiles. Its alpha lerps to 0 as spatial_mix → 1, so the basemap fades out when we morph into UMAP space (where geographic tiles would be nonsense).
alpha-shape polygons (PolygonLayer) — one ring per cluster, stroked + filled. Border carries the emphasis (width → 3.2 px and alpha → 245 when hovered or pinned); fill stays light so station dots inside stay readable. See §8.
flow lines (LineLayer) — the user's current edges array. Width is proportional to sqrt(probability); color depends on direction (red = outbound, blue = inbound, near-black = a single mat_value cell click so it reads against any basemap).
animated rides (ScatterplotLayer) — a swarm of tiny dots taking a Markov-chain random walk over the station network, colored by their current origin's cluster. Inserted just below the station scatter so stations stay clickable. Pool size scales with the city — roughly 2× the station count, capped at 5000 (so NYC gets ~4000, Chicago ~1800, DC ~1400, Boston ~800). On by default; the Rides button in the top bar toggles visibility. See §7 for the full pipeline.
station scatter (ScatterplotLayer) — one dot per station, radius in meters so zoom drives pixel size, fill color by cluster. Focus hub and linked stations get scaled up based on their share of flow.

Picking follows the draw order: station dots win over polygons, so clicking a station behaves the same whether or not a neighborhood is drawn underneath it. The polygon layer disables picking entirely once spatial_mix is past roughly 0.5 (i.e. the layer is almost invisible anyway) to stop hover events from registering on a ghost region.

7. Animated bike-ride simulation

With the static map in place it's easy to forget you're looking at a living system. By default the map drops a swarm of tiny dots onto the network — each one taking a random walk over the station graph in real time — and the Rides button in the top bar toggles them off if you want a static view. The whole simulation runs client-side (the kernel is never involved past the initial data push) and reacts instantly to focus changes and the spatial↔UMAP morph.

Data shipped from Python

Two new traitlets carry everything the JS sampler needs:

transition_topk = traitlets.Dict({}).tag(sync=True)
station_outflow = traitlets.Dict({}).tag(sync=True)

transition_topk: a sparse top-K (K=50) view of the destination-probability matrix. Format {origin_name: [[dest_name, weight], …]}. Built by viz.compute_transition_topk by walking each column of the column-normalized matrix (columns are origins) and keeping the K largest destinations with weight ≥ 1e-4. For NYC's ~2'000 stations × 50 neighbors that's well under 500 KB of JSON.
station_outflow: raw per-station trip counts straight from the trips DataFrame, computed by viz.compute_station_outflow(trips). Optional — only populated when the user passed get_bike_data(…, return_trips=True) through to make_flow_widget(…, trips=trips). Used to weight initial bike placement by real volume; when missing, the JS side falls back to an approximation (see below).

The sampler

makeRideSampler(transitionTopk, stationOutflow) builds three things up front, all at sampler-creation time so each frame is just a few CDF lookups:

A per-origin cumulative distribution over its top-K destinations, stored as parallel dests[] / cum[] arrays so a single Math.random() * total + linear scan picks the next station in O(K).
A volume distribution w over origins, used both to seed the initial pool and to drive teleports (next bullet). When station_outflow is present, w = raw outflow counts — the ground truth for “where rides start”. When it's absent we approximate by power-iterating the chain's stationary distribution π = P·π for 30 steps over the sparse top-K kernel; that naturally concentrates mass on busy hubs but is slightly biased by the top-K truncation.
A teleport CDF derived from w, restricted to stations that also have an outgoing entry so a teleport never lands on a dead-end.

Each Markov step rolls a uniform first: with probability RIDE_TELEPORT_PROB = 0.05 the rider ignores the local distribution and teleports to a station drawn from the volume distribution. This is a PageRank-style move, and it captures the truck-and-van rebalancing that bike-share operators do every night — without it, walkers can spend many steps shuffling between a handful of nearby outer-borough stations because the column-normalized transition_prob has lost absolute trip volume.

The pool and the loop

The simulation is a ridesPool of ride objects, each:

{
  from_name, to_name,        // current segment
  t,                          // progress in [0, 1]
  duration,                   // ms; scales with great-circle distance + jitter
  color,                      // RGB tinted by from-cluster
  position,                   // [lng, lat], lerped each frame
  focused?,                   // see "Focused mode" below
}

A dedicated requestAnimationFrame loop (ridesTick) advances every ride by the inter-frame Δt and recomputes its position by linear interpolation in the current posLookup. That last detail is what makes rides ride along with the spatial↔UMAP morph — the same dot smoothly slides from its geographic-space lerp to its UMAP-space lerp as the slider moves, without any knowledge of either coordinate system in the ride object itself.

When t ≥ 1 the ride takes a Markov step: from_name becomes the old to_name, sampleNext() picks a new destination, and duration is recomputed from the new geography. Overshoot (t - 1) is carried into the new segment so steady-state isn't biased by snap-to-zero on each hop. Color re-tints to the new origin's cluster so dots visibly pick up local color as they cross neighborhood boundaries.

Decoupling from the main render pipeline

The first version of this hooked the rides into scheduleRender(), and immediately killed hover/click responsiveness: a 60 Hz rAF was forcing the whole layer stack (basemap + polygons + flow lines + stations) to rebuild every frame, starving picking events. The fix splits the pipeline:

buildLayers() still produces the four non-rides layers, and caches them in cachedNonRidesLayers. It also refreshes a small ridesCtx object (posLookup, palette, cluster lookup, spatial mix, focus, focusCtx).
ridesTick never re-runs computeStateFromInputs() or rebuilds non-rides layers. Each frame it advances the pool, builds only the rides ScatterplotLayer, splices it into cachedNonRidesLayers just below the station scatter, and pushes the result via deck.setProps({ layers }). Hover and click handlers run against stable layer instances, so picking stays smooth even at 60 fps.

The ScatterplotLayer uses an integer ridesFrame counter as its updateTriggers.getPosition, so deck.gl re-runs the position accessor every frame while leaving the rest of the layer's GPU state alone.

Focused modes

Any selection in the UI — a station click on the map, a row/col label or dendrogram pick on the clustergram, or a single matrix cell — switches the rides simulator out of the ambient Markov walk into a one-shot mode tailored to that selection. computeStateFromInputs tags each selection with a kind ('station', 'mat_cell', 'col_dendro', 'row_dendro', 'cat_value') and buildFocusedRideContext(...) dispatches on it:

station — in/out CDFs over the focused station's top neighbours, drawn straight from store.edges (so the rides match exactly what the line layer is showing). Direction is rolled per ride against total in vs out mass, so a station with mostly inbound trips visually gets mostly inbound animations.
mat_cell — a single fixed (origin → destination) pair. Every ride is the same trip; useful for “watch how often this specific flow fires” once you've clicked a cell in the clustergram heatmap.
col_dendro / cat_value — outgoing rides from any station in the dendrogram or category selection. Per ride: pick an origin from the selection (weighted by station_outflow when shipped, else uniform), then sample its destination from transition_topk.
row_dendro — incoming rides to any station in the selection. We invert transition_topk over the selected destinations once at context-construction time to build a per-destination incoming CDF, then per ride: pick a destination (weighted by total inflow into the selection), then pick an origin from that destination's incoming CDF.

Each focused ride is tagged focused: true; advanceRides checks the flag and retires the ride on segment completion (pool[i] = null) instead of taking another Markov step. refillRides immediately spawns a replacement — the pool stays full but chain continuation is disabled in every focused mode.

Pool size is computed by targetRidesPoolSize(nRides, kind, focusCtx, totalStations) with three regimes. Narrow selections (single station, matrix cell) return max(50, round(nRides × 0.10)) — a one-station view doesn't need thousands of walkers and the dense swarm would obscure the in/out lines. Dendrogram & category selections scale linearly by selection_size / total_stations, floored at 50: per-station ride density stays constant whether you click a 30-station cluster or a 300-station one. Ambient (no focus) returns the full slider value. The per-city slider max is about 10 × num_stations (clamped 1k–20k); the initial thumb is the linear midpoint of that range. When the station list first arrives, if n_rides was only clamped to the placeholder cap (1k before num_stations was known), it is reset to that midpoint so the control doesn't stay pegged near the minimum. With the default and ~2000 NYC stations (max 20k, default ~10k), a 200-station cluster animates ∼1000 walkers, a 30-station cluster ∼150.

Dendro and category selections also opt in to a thin dark-grey stroke around each ride (stroked: true, ~0.8 px line width) because the cluster palette includes light yellows / mints that disappear against the basemap when rendered as flat dots.

Switching between selections (or back to ambient) flushes the pool via flushRidesForSelectionChange, which keys off a signature built from the kind plus the relevant identifiers (focused station name, mat-cell pair, dendro selection set). Same selection reissued under a new linkInteractionSeq doesn't flicker; an actual change immediately swaps the swarm.

Tuning knobs

Pool size — computed per-city by targetRidesPoolSize(numStations, focused). Ambient is clamp(round(RIDES_PER_STATION × numStations), 400, 5000) — NYC ~4000, Chicago ~1800, DC ~1400, Boston ~800. Focused (any station / cell / dendro / cat selection) drops to RIDES_FOCUSED_FRACTION = 0.10 of ambient, floored at RIDES_FOCUSED_MIN = 50. resizeRidesPool grows or trims the pool in place on every selection change. Higher RIDES_PER_STATION feels denser but costs CPU on the per-frame interpolation loop — the cap of 5000 keeps even NYC under a few ms per frame.
RIDE_TELEPORT_PROB = 0.05 — balance between local realism (low) and rebalancing-toward-hubs (high). 0.15 felt too jumpy; 0.0 traps walkers in outer-borough cycles.
RIDE_MS_PER_DEG — constant velocity, in milliseconds per degree of (lng, lat) distance. We use a constant-velocity model so every dot covers the same on-screen speed regardless of hop length: a short adjacent-station hop finishes quickly, a cross-town hop takes proportionally longer, and the visual flow speed stays consistent across the whole map. RIDE_SEG_VEL_JITTER applies a small (±18%) multiplicative jitter on the velocity so dots in the same wave don't arrive in lockstep, and RIDE_SEG_MIN_MS guards against zero-length segments (co-located stations) becoming instantaneous.
Dot radius (getRadius: 1.3, min 0.8 px / max 2.2 px) — small enough that 1000 simultaneous rides don't overwhelm the station scatter.

8. Alpha-shape neighborhoods

The polygons come from celldega.nbhd.alpha_shape, which is a thin wrapper over libpysal's alpha-shape implementation. Input is a point cloud; output is a MultiPolygon that approximates the “tight concave hull” of those points, parameterized by an inverse-radius α (smaller = more filigree, larger = closer to the convex hull).

Projection

libpysal uses euclidean distance on the input coordinates, so if you pass raw (lng, lat) you're implicitly working in degrees — useless for “radius in miles”. nbhd.compute_cluster_alpha_shapes runs each city's stations through a local equirectangular projection centered on the mean station, converting to meters, runs the alpha shape with α⁻¹ = r × 1609.34, and uses a rounded-meter lookup table to map output vertices back to both their original (lng, lat) and their UMAP coords.

Precomputation

Ten log-spaced radii from 0.05 mi (block scale) to 5 mi (cross-borough) are precomputed per cluster at notebook-build time. The payload is a single dict:

{
  "levels_miles": [0.05, 0.084, 0.140, …, 5.0],
  "polygons": [
    {"cluster_id": 1, "by_level": [
      [{"geo": [[lng,lat],…], "umap": [[lng,lat],…]}, …],   # level 0
      …
    ]},
    …
  ]
}

Coords are rounded to 5 decimals (≈1 m precision). For a typical city (150 clusters, avg 15-vertex polygons, 10 levels) that's well under 200 KB of JSON, which embeds comfortably into the standalone widget HTMLs. A binary-buffer encoding via anywidget is a drop-in replacement if we ever need to go further.

Flicker fix: layer id per resolution

The NBHD slider is a pure front-end control — it just changes store.alpha_index. Originally the PolygonLayer had a constant id and relied on updateTriggers to swap the getPolygon accessor. Unfortunately that tells deck.gl “the same layer, new geometry” — which makes it try to interpolate attribute transitions across vertex buffers of wildly different sizes, causing a flicker at every slider step. Fix: include alphaIdx in the layer id. A resolution change is now a full layer swap; the transitions we do want (hover color/width) still run smoothly within a single resolution.

9. Spatial ↔ UMAP morph

Each station ships with both a (lng, lat) pair and a (umap_lng, umap_lat) pair (centered on the geographic centroid so the two frames overlay). A single spatial_mix float in [0, 1] drives everything:

// in buildLayers()
const t = spatialMix;
for (const s of stations) {
  const lng = Number(s.lng) * (1 - t) + (Number(s.umap_lng) + dLng) * t;
  const lat = Number(s.lat) * (1 - t) + (Number(s.umap_lat) + dLat) * t;
  posLookup[s.name] = [lng, lat];
}

Flow lines read the same posLookup so endpoints follow. Alpha-shape polygons morph vertex-by-vertex using the geo↔UMAP lookup we built at precomputation time — no alpha shape is ever computed in UMAP space, we just drag the existing vertices along with their stations. The polygon layer's alpha fades to 0 by spatial_mix = 1 because a geo alpha shape loses meaning once it's warped.

The basemap likewise fades out, and the map area's CSS background lerps from a tile-matching grey to pure white so the UMAP view ends up on a clean canvas.

10. Building the static HTMLs

Two notebook-driven pipelines produce the artifacts shipped alongside index.html:

build_htmls.ipynb runs each city notebook via nbconvert.ExecutePreprocessor then exports to a full <City>.html (code cells + outputs).
build_widget_htmls.ipynb generates a minimal per-city widget shell by programmatically building a single-code-cell notebook that invokes make_flow_widget + link_flow_to_clustergram, runs it, then exports via HTMLExporter with exclude_input. The result is a self-contained <City>_widget.html with just the HBox of the flow map + clustergram — that's what the iframe on the landing page loads.

We initially tried ipywidgets.embed_minimal_html and hit an incompatibility with anywidget's ES-module bundling (embed-amd.js couldn't locate anywidget.js on disk). The nbconvert route bundles anywidget's JS as a data URI into the notebook's output HTML, so there's no runtime fetch and the page works from file:// URLs.

11. Known limits & future work

No Clustergram hover linkage. Celldega's Clustergram keeps hover state (hovered_cat) purely in its own JS store; it never pushes to a Python traitlet. Only clicks make it across. Adding a hover traitlet upstream is an obvious improvement.
Alpha shapes are computed per cluster, not jointly. Overlapping clusters produce overlapping polygons. A constrained Voronoi or nested containment scheme would clean that up.
UMAP is recomputed at build time. The embedding's random initialization means rebuilds can subtly reshuffle the UMAP layout; we'd want a fixed seed or a saved embedding to stabilize publication-quality outputs.
Flow lines cap at top-30. The map asks the Clustergram for a top-30 slice on station click (MAP_STATION_SLICE_TOP_K) and lets the simulated rides carry the long-tail signal visually — bumping the cap higher quickly turns Manhattan hubs into spaghetti. An alternate ArcLayer rendering for arched cross-town flows, or a “show all” mode that fades sub-pixel edges by probability, would let us push the cap up without sacrificing readability.
Widget-only HTML is ~1.5 MB per city, dominated by the deck.gl + anywidget bundles. Moving to a shared CDN import, or lazy-loading the map/cgm bundles, would trim it significantly.