SHERIDAN, WYOMING -- June 25, 2026 -- General Motors has published details of EMWU, an offboard data pipeline designed to extract rare, safety-critical driving scenarios from millions of miles of autonomous vehicle fleet footage without proportional increases in compute cost. GM's autonomous vehicle fleet generates continuous sensor and camera streams across hundreds of test vehicles, and the volume of that data has made brute-force AI analysis economically and technically unworkable. EMWU separates the problem into three distinct tiers — domain projection, candidate retrieval, and deep visual reasoning — each calibrated to spend compute only where it is warranted.
GM's Fleet Data Volume Creates a Bottleneck That Single-Model AI Cannot Solve
Autonomous vehicle development depends on long-tail scenario mining: finding the rare near-miss collisions, unusual road obstacles, or novel construction equipment that appear only a handful of times across millions of hours of footage. GM operates hundreds of test vehicles that have produced millions of miles of high-quality recorded data. Processing that corpus with a flagship multimodal large language model applied uniformly to every clip would be prohibitively expensive and, as GM's engineers found, technically unreliable. At scale, these models exhibit what the team calls "lost in the middle" behavior — underweighting or missing subtle temporal interactions when context windows become overloaded with dense video inputs.
The compute economics are stark. Video inputs and extended reasoning outputs demand far more processing than text-only queries. Apply that cost to millions of clips and a single-tier approach collapses under its own weight.
EMWU Routes Each Clip Through the Cheapest Tier That Can Handle It
EMWU operates on a principle GM describes as doing the cheapest reusable work first and reserving expensive reasoning for the small fraction of cases that truly require it. Tier 1 runs a detection model across the full corpus, producing bounding boxes, object tags, and image embeddings for autonomous vehicle domain objects — cars, pedestrians, traffic lights — at a cost GM has engineered down to less than $1.00 per one million embeddings per day. Near-100% GPU utilization was achieved through optimized data loading, prefetching, and tuned worker parallelism. These embeddings are stored as durable artifacts, reusable across future queries without reprocessing.
Tier 2 handles candidate retrieval. When a user issues a query — for example, searching for ambulances on the road or a specific type of debris — the system encodes that query as a text embedding and runs a similarity search over the stored patch embeddings using an Inverted File Index structure. This narrows the search space from millions of clips down to a highly relevant candidate set. GM chose IVF approximate nearest-neighbor search over a fully in-memory index for cost reasons, accepting a modest precision trade-off in exchange for substantially lower storage overhead.
Only Tier 3 engages the expensive visual language model. It receives the top candidate frames, enriched with contextual data such as time window information and multi-camera views, and applies a task-tailored reasoning prompt. The VLM handles the spatial and temporal complexity that embedding search cannot reliably resolve.
Spatial and Temporal Ambiguity Require Deep Reasoning, Not Pattern Matching
GM's engineers are explicit about why deep reasoning remains necessary at the final tier. Embedding-based retrieval handles category-level similarity well — it can find clips that look like prior examples. But safety-critical events are defined by relationships, not categories. Whether two vehicles have intersecting trajectories, whether an agent is reacting to another agent rather than to environmental conditions, whether a partially occluded object poses a hazard — these are judgment calls that require causal inference, not pattern correlation. Occlusion, clutter, and incomplete evidence are standard features of real-world driving scenes. Without the VLM's reasoning capacity at Tier 3, the system would surface visually similar clips but miss the behavioral content that actually matters for safety validation.
Temporal reasoning adds another layer. Video frames are snapshots. Inferring latent states — intent, attention, acceleration trends — requires reasoning across sequences, not just across frames. Distinguishing causal attribution from spurious correlation (did driver A react to driver B, or did both respond to traffic density?) is exactly the kind of task where VLMs outperform lighter retrieval methods, but only when given the right context.
LoRA Adapters Allow New Query Types Without Rebuilding the Embedding Database
A persistent challenge in domain-adapted AI systems is the cost of re-indexing. When engineers need to retrieve a concept that was not well-separated in the original embedding space — a newly observed type of dangerous road debris, for example — traditional methods require regenerating embeddings across the full corpus. EMWU avoids this through lightweight model fine-tuning using Low-Rank Adaptation adapters applied only to the text tower of a CLIP-style model, leaving the vision encoder frozen. A linear projection layer aligns the adapted text representations with the existing visual embedding space at query time. No visual embeddings are regenerated. GM reports that this approach requires fewer than 100 labeled examples for a new long-tail scenario and eliminates expensive database backfills entirely.
GM Targets Logarithmic Cost Growth as Fleet Video Volume Expands
The engineering goal behind EMWU is not just cost reduction today — it is a cost scaling curve that does not grow linearly with fleet size. GM's team states that as video volume increases, the architecture is designed to push more work into the lowest, cheapest tiers of the cascade, keeping expensive VLM reasoning anchored to a small and consistent fraction of total clips. The domain projection artifacts generated at Tier 1 are reusable across any number of future queries. Each new mining task draws on the same pre-computed embeddings rather than triggering fresh inference runs across the full corpus.
This architecture is positioned by GM as part of the foundational infrastructure for physical AI — autonomous systems that must learn from real-world conditions safely and continuously as they encounter new scenarios, new vehicle types, and new edge cases that no simulation environment would have predicted.