๐Ÿ”ฌ Temporal State Prediction

A frozen V-JEPA-2 ViT-L encoder with a small attentive-pool head predicts per-cell cell-cycle state straight from a tracked clip, with no separate segment, track, and classify steps.

Finding: a frozen video foundation model holds its own against a purpose-built morphology and temporal baseline, and the binding constraint is data scale, not model capacity.

Select a single-cell clip below. The model classifies that one clip into its cell-cycle state.

Predicted: interphase

Actual: interphase

โœ… correct

Held-out HeLa (sequence 02, n=5312)

model macro-F1 mitosis F1 mitosis event P/R (ยฑ3fr)
U-Net+BiLSTM baseline (3.8M) 0.629 0.599 0.65 / 0.62
frozen V-JEPA-2 head-only 0.6776 0.681 0.69 / 0.74

Data scaling (GOWT1 to HeLa) lifts the same baseline by +0.186 macro-F1; a roughly 80x larger model adds only +0.046. Seed band: 0.635 ยฑ 0.098 (3 seeds). Single-seed gaps below 0.08 are not significant.

Confusion matrix (rows = true, cols = pred):

true โงต pred interphase pre-mitosis mitosis recall
interphase 4550 211 52 0.95
pre-mitosis 173 136 14 0.42
mitosis 44 7 125 0.71

Dominant error: pre-mitosis read as interphase. It is a soft, lineage-defined 8-frame window with no sharp morphological boundary.


Models: DnaRnaProteins/vjepa2-cell-cycle-vit-l, DnaRnaProteins/unet-bilstm-cell-cycle-baseline ยท Data: MICCAI Cell Tracking Challenge (Fluo-N2DL-HeLa). Labels derived from lineage trees (no manual annotation).