Engineering notes
How the linked flow map and clustergram talk to each other, and the plumbing underneath: data pipeline, widget architecture, observable store, alpha-shape neighborhoods, and the standalone HTML builds.
1. Architecture at a glance
Two Jupyter widgets sit side-by-side in an HBox, both rendered via
anywidget (so they ship as ES modules that work in
JupyterLab, the standalone HTML exports, and anywhere ipywidgets renders):
Everything visible to the user lives in the browser: two JS bundles, two anywidget
models, and a bidirectional traitlet bridge called jsdlink that mirrors
selected Python traitlets across both widgets so one can react to the other without
involving the kernel.
2. Data pipeline
Each bike-share operator (Citi Bike NYC, Bluebikes Boston, Capital Bikeshare DC,
Divvy Chicago) publishes monthly trip CSVs to a public S3 bucket. The Python
package (bike_network_traffic.data) does the fetch + transform:
-
Resolve a (city, year, month) tuple to one or more S3 URLs by parsing the
bucket's
index.html(just a list of<Key>entries). Month can be a single int, a list, orNonefor a whole year. -
Download each monthly zip into
~/.cache/bike_network_traffic/, extract the trip CSV(s), and concatenate them. Streaming HTTP with a per-chunk progress indicator is off by default so notebooks stay quiet. -
Normalize schemas. Each operator went through a rename in 2021-ish
(
start_station_idvsfrom_station_id, etc.); we map them to a single canonical layout. -
Build a station metadata frame
(
station_id,station_name,lat,lng) and the destination-probability matrix: for each origin station s, the row is P(destination = d | origin = s), obtained by group-counting trip endpoints and normalizing.
That probability matrix is our feature table for Celldega. Rows are origin stations, columns are destination stations, values sum to 1 per row — the same shape you'd feed into a clustergram of expression data.
Clustering & UMAP
viz.make_station_clustergram runs hierarchical clustering via
Celldega's clust.Matrix to produce both a rendered
Clustergram widget and a {station_name: cluster_id} map.
That map is the join key that lets the map render each station with its cluster
color and compute one alpha-shape per cluster.
Separately, the probability matrix goes through a small scanpy pipeline
(anndata → neighbors → UMAP) to get a 2-D embedding of each
origin station by its behavioral similarity to others. Stations that tend
to send riders to the same destinations end up near each other in UMAP space even
if they sit on opposite sides of the city. That embedding becomes the second
coordinate system on the map (see §8).
3. The two widgets
BikeFlowMapWidget (ours)
src/bike_network_traffic/widget.py is a tiny anywidget.AnyWidget
subclass. Its whole job is to declare a set of traitlets and point at
a bundled ES module:
class BikeFlowMapWidget(anywidget.AnyWidget):
_esm = BUNDLE_PATH / "widget.js"
stations = traitlets.List(default_value=[]).tag(sync=True)
edge_index = traitlets.Dict(default_value={}).tag(sync=True)
click_info = traitlets.Dict(default_value={}).tag(sync=True)
selected_rows = traitlets.List(default_value=[]).tag(sync=True)
selected_cols = traitlets.List(default_value=[]).tag(sync=True)
matrix_axis_slice = traitlets.Dict(default_value={}).tag(sync=True)
palette_rgb = traitlets.List(default_value=[]).tag(sync=True)
spatial_mix = traitlets.Float(0.0).tag(sync=True)
cluster_polygons = traitlets.Dict(default_value={}).tag(sync=True)
alpha_index = traitlets.Int(4).tag(sync=True)
show_neighborhoods = traitlets.Bool(True).tag(sync=True)
show_stations = traitlets.Bool(True).tag(sync=True)
# … plus cg_row_names / cg_col_names / matrix_slice_request_out, for jsdlink.
Everything tag(sync=True) is mirrored into the front-end model, and any
JS-side model.set(…) call mirrors right back to Python. The JS
bundle (js/bike_flow_map_widget.mjs, built with esbuild into
widget.js) does all the rendering and interaction. The Python class
almost never has to touch that state — the one Python-side observer is
matrix_slice_request_out, which forwards map click requests into
Celldega's matrix_slice_request traitlet.
Celldega Clustergram
Celldega's viz.Clustergram is a much larger widget — heatmap,
dendrograms, row/col manual category bars, row/col axis sliders. It emits a handful
of Python-observable events that we care about, all via a single
click_info traitlet that holds a {type, value} payload:
row_label/col_label— clicked a station axis labelmat_value— clicked a matrix cell (two stations + a probability)row_dendro/col_dendro— selected a dendrogram branchcat_value— clicked a manual category bar entry
Plus matrix_axis_slice, which returns a sorted top-k slice of
the matrix for a given row or column — this is what we use to draw the flow lines
when a station is clicked, without having to ship the whole probability matrix to
the browser.
4. Clustergram → map linkage
Python never runs during a user interaction — all linkage is client-side. The wiring lives in one small helper:
def link_flow_to_clustergram(flow, cgm):
jsdlink((cgm, "click_info"), (flow, "click_info"))
jsdlink((cgm, "selected_rows"), (flow, "selected_rows"))
jsdlink((cgm, "selected_cols"), (flow, "selected_cols"))
jsdlink((cgm, "matrix_axis_slice"), (flow, "matrix_axis_slice"))
jsdlink((cgm, "row_names"), (flow, "cg_row_names"))
jsdlink((cgm, "col_names"), (flow, "cg_col_names"))
jsdlink((flow, "matrix_slice_request_out"), (cgm, "matrix_slice_request"))
jsdlink is ipywidgets' “JavaScript-side directional link”: it
registers a front-end listener that copies one widget's traitlet to another's
without a kernel round-trip. When the Clustergram pushes a new
click_info, our widget's change:click_info handler fires
within the same animation frame.
Direction: overwhelmingly Clustergram → flow map. The one reverse
channel is matrix_slice_request_out: clicking a station on the map
asks the Clustergram to return a row-and-column slice for that station, which then
comes back as matrix_axis_slice and drives the flow lines.
5. Observable store on the JS side
The map widget's frontend is a small hand-rolled reactive store — essentially
a tree of one-line observables rather than React/Redux. Each slot has
get(), set(newValue), and subscribe(fn); setters
skip no-op updates, and subscribers get the immediate value when they register
(opt-in):
const Observable = (initialValue) => {
let value = initialValue;
const subscribers = new Set();
return {
get: () => value,
set: (v) => { if (value === v) return; value = v; subscribers.forEach((fn) => fn(v)); },
subscribe: (fn, { immediate = true } = {}) => {
subscribers.add(fn);
if (immediate) fn(value);
return () => subscribers.delete(fn);
},
};
};
The store is created once per render() call and holds both
mirrors of Python traitlets (stations, click_info,
cluster_polygons, spatial_mix, alpha_index, …)
and derived / pure-JS state (focus, highlights,
edges, hovered_cluster, pinned_cluster).
All writes are batched into a single requestAnimationFrame via
scheduleRender(), which runs a 3-step pipeline:
-
inputs synced — the Python-side traitlets have already been
copied into the store by
model.on("change:…")handlers. -
compute derived state —
computeStateFromInputs()inspectsclick_info.typeand the axis-slice traitlets to decide what the map should show: focus station, a list of highlights, and theedgesarray. This is where arow_labelclick becomes “station N is the focus, 24 top-neighbor edges go in/out”. -
prepare deck props —
buildLayers()produces the layer array and then either boots a newDeckinstance or callsdeck.setProps(…)to update an existing one.
A small state machine (deck_check: {inputs, computed, layers}) gates the
actual setProps call, so updates triggered by multiple traitlet changes
in the same tick produce one redraw, not three.
6. deck.gl layer stack & picking
buildLayers() returns these, in order — later layers render on top:
-
basemap (
TileLayer + BitmapLayer) — Carto light tiles. Its alpha lerps to 0 asspatial_mix→ 1, so the basemap fades out when we morph into UMAP space (where geographic tiles would be nonsense). -
alpha-shape polygons (
PolygonLayer) — one ring per cluster, stroked + filled. Border carries the emphasis (width → 3.2 px and alpha → 245 when hovered or pinned); fill stays light so station dots inside stay readable. See §8. -
flow lines (
LineLayer) — the user's currentedgesarray. Width is proportional tosqrt(probability); color depends on direction (red = outbound, blue = inbound, near-black = a singlemat_valuecell click so it reads against any basemap). -
animated rides (
ScatterplotLayer) — a swarm of tiny dots taking a Markov-chain random walk over the station network, colored by their current origin's cluster. Inserted just below the station scatter so stations stay clickable. Pool size scales with the city — roughly 2× the station count, capped at 5000 (so NYC gets ~4000, Chicago ~1800, DC ~1400, Boston ~800). On by default; the Rides button in the top bar toggles visibility. See §7 for the full pipeline. -
station scatter (
ScatterplotLayer) — one dot per station, radius in meters so zoom drives pixel size, fill color by cluster. Focus hub and linked stations get scaled up based on their share of flow.
Picking follows the draw order: station dots win over polygons, so clicking a station
behaves the same whether or not a neighborhood is drawn underneath it. The polygon
layer disables picking entirely once spatial_mix is past roughly 0.5
(i.e. the layer is almost invisible anyway) to stop hover events from registering on
a ghost region.
7. Animated bike-ride simulation
With the static map in place it's easy to forget you're looking at a living system. By default the map drops a swarm of tiny dots onto the network — each one taking a random walk over the station graph in real time — and the Rides button in the top bar toggles them off if you want a static view. The whole simulation runs client-side (the kernel is never involved past the initial data push) and reacts instantly to focus changes and the spatial↔UMAP morph.
Data shipped from Python
Two new traitlets carry everything the JS sampler needs:
transition_topk = traitlets.Dict({}).tag(sync=True)
station_outflow = traitlets.Dict({}).tag(sync=True)
-
transition_topk: a sparse top-K (K=50) view of the destination-probability matrix. Format{origin_name: [[dest_name, weight], …]}. Built byviz.compute_transition_topkby walking each column of the column-normalized matrix (columns are origins) and keeping the K largest destinations with weight ≥ 1e-4. For NYC's ~2'000 stations × 50 neighbors that's well under 500 KB of JSON. -
station_outflow: raw per-station trip counts straight from the trips DataFrame, computed byviz.compute_station_outflow(trips). Optional — only populated when the user passedget_bike_data(…, return_trips=True)through tomake_flow_widget(…, trips=trips). Used to weight initial bike placement by real volume; when missing, the JS side falls back to an approximation (see below).
The sampler
makeRideSampler(transitionTopk, stationOutflow) builds three things up
front, all at sampler-creation time so each frame is just a few CDF lookups:
-
A per-origin cumulative distribution over its top-K destinations,
stored as parallel
dests[]/cum[]arrays so a singleMath.random() * total+ linear scan picks the next station in O(K). -
A volume distribution w over origins, used both to seed
the initial pool and to drive teleports (next bullet). When
station_outflowis present, w = raw outflow counts — the ground truth for “where rides start”. When it's absent we approximate by power-iterating the chain's stationary distribution π = P·π for 30 steps over the sparse top-K kernel; that naturally concentrates mass on busy hubs but is slightly biased by the top-K truncation. - A teleport CDF derived from w, restricted to stations that also have an outgoing entry so a teleport never lands on a dead-end.
Each Markov step rolls a uniform first: with probability
RIDE_TELEPORT_PROB = 0.05 the rider ignores the local distribution and
teleports to a station drawn from the volume distribution. This is a PageRank-style
move, and it captures the truck-and-van rebalancing that bike-share operators do
every night — without it, walkers can spend many steps shuffling between a
handful of nearby outer-borough stations because the column-normalized
transition_prob has lost absolute trip volume.
The pool and the loop
The simulation is a ridesPool of ride objects, each:
{
from_name, to_name, // current segment
t, // progress in [0, 1]
duration, // ms; scales with great-circle distance + jitter
color, // RGB tinted by from-cluster
position, // [lng, lat], lerped each frame
focused?, // see "Focused mode" below
}
A dedicated requestAnimationFrame loop (ridesTick) advances
every ride by the inter-frame Δt and recomputes its
position by linear interpolation in the current
posLookup. That last detail is what makes rides ride along with the
spatial↔UMAP morph — the same dot smoothly slides from its
geographic-space lerp to its UMAP-space lerp as the slider moves, without any
knowledge of either coordinate system in the ride object itself.
When t ≥ 1 the ride takes a Markov step: from_name
becomes the old to_name, sampleNext() picks a new
destination, and duration is recomputed from the new geography.
Overshoot (t - 1) is carried into the new segment so steady-state
isn't biased by snap-to-zero on each hop. Color re-tints to the new origin's
cluster so dots visibly pick up local color as they cross neighborhood boundaries.
Decoupling from the main render pipeline
The first version of this hooked the rides into scheduleRender(), and
immediately killed hover/click responsiveness: a 60 Hz rAF was forcing the whole
layer stack (basemap + polygons + flow lines + stations) to rebuild every frame,
starving picking events. The fix splits the pipeline:
-
buildLayers()still produces the four non-rides layers, and caches them incachedNonRidesLayers. It also refreshes a smallridesCtxobject (posLookup, palette, cluster lookup, spatial mix, focus, focusCtx). -
ridesTicknever re-runscomputeStateFromInputs()or rebuilds non-rides layers. Each frame it advances the pool, builds only the ridesScatterplotLayer, splices it intocachedNonRidesLayersjust below the station scatter, and pushes the result viadeck.setProps({ layers }). Hover and click handlers run against stable layer instances, so picking stays smooth even at 60 fps.
The ScatterplotLayer uses an integer ridesFrame counter as
its updateTriggers.getPosition, so deck.gl re-runs the position
accessor every frame while leaving the rest of the layer's GPU state alone.
Focused modes
Any selection in the UI — a station click on the map, a row/col label or
dendrogram pick on the clustergram, or a single matrix cell — switches the
rides simulator out of the ambient Markov walk into a one-shot mode tailored to
that selection. computeStateFromInputs tags each selection with a
kind ('station', 'mat_cell',
'col_dendro', 'row_dendro', 'cat_value') and
buildFocusedRideContext(...) dispatches on it:
-
station — in/out CDFs over the focused station's top
neighbours, drawn straight from
store.edges(so the rides match exactly what the line layer is showing). Direction is rolled per ride against total in vs out mass, so a station with mostly inbound trips visually gets mostly inbound animations. - mat_cell — a single fixed (origin → destination) pair. Every ride is the same trip; useful for “watch how often this specific flow fires” once you've clicked a cell in the clustergram heatmap.
-
col_dendro / cat_value — outgoing
rides from any station in the dendrogram or category selection. Per ride: pick an
origin from the selection (weighted by
station_outflowwhen shipped, else uniform), then sample its destination fromtransition_topk. -
row_dendro — incoming rides to any station in the
selection. We invert
transition_topkover the selected destinations once at context-construction time to build a per-destination incoming CDF, then per ride: pick a destination (weighted by total inflow into the selection), then pick an origin from that destination's incoming CDF.
Each focused ride is tagged focused: true; advanceRides
checks the flag and retires the ride on segment completion
(pool[i] = null) instead of taking another Markov step.
refillRides immediately spawns a replacement — the pool stays
full but chain continuation is disabled in every focused mode.
Pool size is computed by
targetRidesPoolSize(nRides, kind, focusCtx, totalStations) with
three regimes. Narrow selections (single station, matrix cell)
return max(50, round(nRides × 0.10)) — a one-station
view doesn't need thousands of walkers and the dense swarm would obscure the
in/out lines. Dendrogram & category selections scale linearly
by selection_size / total_stations, floored at 50: per-station
ride density stays constant whether you click a 30-station cluster or a
300-station one. Ambient (no focus) returns the full slider value.
The per-city slider max is about 10 × num_stations (clamped
1k–20k); the initial thumb is the linear midpoint of that range. When the
station list first arrives, if n_rides was only clamped to the
placeholder cap (1k before num_stations was known), it is reset
to that midpoint so the control doesn't stay pegged near the minimum.
With the default and ~2000 NYC stations (max 20k, default ~10k), a 200-station
cluster animates ∼1000 walkers, a 30-station cluster ∼150.
Dendro and category selections also opt in to a thin dark-grey stroke around
each ride (stroked: true, ~0.8 px line width) because the
cluster palette includes light yellows / mints that disappear against the
basemap when rendered as flat dots.
Switching between selections (or back to ambient) flushes the pool via
flushRidesForSelectionChange, which keys off a signature built from
the kind plus the relevant identifiers (focused station name, mat-cell pair,
dendro selection set). Same selection reissued under a new
linkInteractionSeq doesn't flicker; an actual change immediately
swaps the swarm.
Tuning knobs
-
Pool size — computed per-city by
targetRidesPoolSize(numStations, focused). Ambient isclamp(round(RIDES_PER_STATION × numStations), 400, 5000)— NYC ~4000, Chicago ~1800, DC ~1400, Boston ~800. Focused (any station / cell / dendro / cat selection) drops toRIDES_FOCUSED_FRACTION = 0.10of ambient, floored atRIDES_FOCUSED_MIN = 50.resizeRidesPoolgrows or trims the pool in place on every selection change. HigherRIDES_PER_STATIONfeels denser but costs CPU on the per-frame interpolation loop — the cap of 5000 keeps even NYC under a few ms per frame. -
RIDE_TELEPORT_PROB = 0.05— balance between local realism (low) and rebalancing-toward-hubs (high). 0.15 felt too jumpy; 0.0 traps walkers in outer-borough cycles. -
RIDE_MS_PER_DEG— constant velocity, in milliseconds per degree of (lng, lat) distance. We use a constant-velocity model so every dot covers the same on-screen speed regardless of hop length: a short adjacent-station hop finishes quickly, a cross-town hop takes proportionally longer, and the visual flow speed stays consistent across the whole map.RIDE_SEG_VEL_JITTERapplies a small (±18%) multiplicative jitter on the velocity so dots in the same wave don't arrive in lockstep, andRIDE_SEG_MIN_MSguards against zero-length segments (co-located stations) becoming instantaneous. -
Dot radius (
getRadius: 1.3, min 0.8 px / max 2.2 px) — small enough that 1000 simultaneous rides don't overwhelm the station scatter.
8. Alpha-shape neighborhoods
The polygons come from celldega.nbhd.alpha_shape, which is a thin
wrapper over libpysal's alpha-shape implementation. Input is a
point cloud; output is a MultiPolygon that approximates the “tight
concave hull” of those points, parameterized by an inverse-radius α (smaller
= more filigree, larger = closer to the convex hull).
Projection
libpysal uses euclidean distance on the input coordinates, so if you pass raw
(lng, lat) you're implicitly working in degrees — useless for “radius
in miles”. nbhd.compute_cluster_alpha_shapes runs each city's
stations through a local equirectangular projection centered on the mean station,
converting to meters, runs the alpha shape with α−1 = r
× 1609.34, and uses a rounded-meter lookup table to map output vertices back
to both their original (lng, lat) and their UMAP coords.
Precomputation
Ten log-spaced radii from 0.05 mi (block scale) to 5 mi (cross-borough) are precomputed per cluster at notebook-build time. The payload is a single dict:
{
"levels_miles": [0.05, 0.084, 0.140, …, 5.0],
"polygons": [
{"cluster_id": 1, "by_level": [
[{"geo": [[lng,lat],…], "umap": [[lng,lat],…]}, …], # level 0
…
]},
…
]
}
Coords are rounded to 5 decimals (≈1 m precision). For a typical city (150 clusters, avg 15-vertex polygons, 10 levels) that's well under 200 KB of JSON, which embeds comfortably into the standalone widget HTMLs. A binary-buffer encoding via anywidget is a drop-in replacement if we ever need to go further.
Flicker fix: layer id per resolution
The NBHD slider is a pure front-end control — it just changes
store.alpha_index. Originally the PolygonLayer had a
constant id and relied on updateTriggers to swap the
getPolygon accessor. Unfortunately that tells deck.gl
“the same layer, new geometry” — which makes it try to interpolate
attribute transitions across vertex buffers of wildly different sizes, causing a
flicker at every slider step. Fix: include alphaIdx in the layer id.
A resolution change is now a full layer swap; the transitions we do want (hover
color/width) still run smoothly within a single resolution.
9. Spatial ↔ UMAP morph
Each station ships with both a (lng, lat) pair and a
(umap_lng, umap_lat) pair (centered on the geographic centroid so the
two frames overlay). A single spatial_mix float in [0, 1] drives
everything:
// in buildLayers()
const t = spatialMix;
for (const s of stations) {
const lng = Number(s.lng) * (1 - t) + (Number(s.umap_lng) + dLng) * t;
const lat = Number(s.lat) * (1 - t) + (Number(s.umap_lat) + dLat) * t;
posLookup[s.name] = [lng, lat];
}
Flow lines read the same posLookup so endpoints follow. Alpha-shape
polygons morph vertex-by-vertex using the geo↔UMAP lookup we built at
precomputation time — no alpha shape is ever computed in UMAP space, we just
drag the existing vertices along with their stations. The polygon layer's alpha
fades to 0 by spatial_mix = 1 because a geo alpha shape loses meaning
once it's warped.
The basemap likewise fades out, and the map area's CSS background lerps from a tile-matching grey to pure white so the UMAP view ends up on a clean canvas.
10. Building the static HTMLs
Two notebook-driven pipelines produce the artifacts shipped alongside
index.html:
-
build_htmls.ipynbruns each city notebook vianbconvert.ExecutePreprocessorthen exports to a full<City>.html(code cells + outputs). -
build_widget_htmls.ipynbgenerates a minimal per-city widget shell by programmatically building a single-code-cell notebook that invokesmake_flow_widget+link_flow_to_clustergram, runs it, then exports viaHTMLExporterwithexclude_input. The result is a self-contained<City>_widget.htmlwith just the HBox of the flow map + clustergram — that's what the iframe on the landing page loads.
We initially triedipywidgets.embed_minimal_htmland hit an incompatibility with anywidget's ES-module bundling (embed-amd.jscouldn't locateanywidget.json disk). The nbconvert route bundles anywidget's JS as a data URI into the notebook's output HTML, so there's no runtime fetch and the page works from file:// URLs.
11. Known limits & future work
-
No Clustergram hover linkage. Celldega's Clustergram keeps
hover state (
hovered_cat) purely in its own JS store; it never pushes to a Python traitlet. Only clicks make it across. Adding a hover traitlet upstream is an obvious improvement. - Alpha shapes are computed per cluster, not jointly. Overlapping clusters produce overlapping polygons. A constrained Voronoi or nested containment scheme would clean that up.
- UMAP is recomputed at build time. The embedding's random initialization means rebuilds can subtly reshuffle the UMAP layout; we'd want a fixed seed or a saved embedding to stabilize publication-quality outputs.
-
Flow lines cap at top-30. The map asks the Clustergram for a
top-30 slice on station click (
MAP_STATION_SLICE_TOP_K) and lets the simulated rides carry the long-tail signal visually — bumping the cap higher quickly turns Manhattan hubs into spaghetti. An alternateArcLayerrendering for arched cross-town flows, or a “show all” mode that fades sub-pixel edges by probability, would let us push the cap up without sacrificing readability. - Widget-only HTML is ~1.5 MB per city, dominated by the deck.gl + anywidget bundles. Moving to a shared CDN import, or lazy-loading the map/cgm bundles, would trim it significantly.