NVIDIA announced its next-generation AI platform, Vera Rubin, at GTC 2026, alongside an integration with Groq's LPU (Language Processing Unit) that splits the transformer inference pipeline between GPU and dedicated decode hardware. The platform, built on TSMC's N3P (3nm) process, packs 336 billion transistors and uses HBM4 memory. It's aimed at organizations deploying large language models (LLMs) at scale, in particular those facing the memory-bandwidth bottleneck during the decode phase of token generation.
What's striking here is not just the raw specs, but the admission that a single GPU architecture isn't optimal for every part of the inference process. Prefill — processing the input prompt — is compute-bound, while decode — generating each new token — is memory-bandwidth-bound. NVIDIA says offloading decode to Groq's SRAM-based LPU can reduce total inference cost by an order of magnitude.
What is Vera Rubin?
Vera Rubin is NVIDIA's new AI compute platform, replacing Blackwell. Key hardware specs include:
- Built on TSMC N3P process (3nm)
- 336 billion transistors
- HBM4 memory
- Shipping in Q3 2026
NVIDIA claims Vera Rubin reduces token generation costs by 10x compared to its predecessor, Blackwell, and cuts GPU requirements for training Mixture-of-Experts models by 4x. It also unveiled five new MGX-series racks for large-scale POD deployments, though full specs have not been released.
Early customers include Meta, OpenAI, and Anthropic, all of which are set to receive systems in early Q3 2026. Major cloud providers — AWS, Google Cloud, Azure, and Oracle Cloud — plan to deploy Vera Rubin instances in the second half of 2026. Exact pricing for the chips or racks was not disclosed.
How Groq's LPU Fits In
Groq's third-generation LPU uses SRAM, not HBM like conventional GPUs. SRAM is faster but smaller — the tradeoff is extreme memory bandwidth per die for certain workloads. In a prefill-decode split, the LPU handles the decode phase, which is where most tokens are generated. NVIDIA and Groq offer a new rack configuration, the LPX, which pairs 256 LPUs with a Vera Rubin NVL72. NVIDIA recommends that data centers aim for roughly 25% LPU capacity relative to overall compute for optimal inference efficiency.
Groq has previously demonstrated 241 tokens/second on Llama 2 70B, more than double other providers at the time. The company claims the LPU delivers about 10x throughput for LLM inference at 90% lower power consumption per compute operation compared to standard GPUs. Those power savings are at the chip level, not the full system. The LPU also provides "an order of magnitude more memory bandwidth per die than a Rubin GPU," according to the firms.
“By offloading the decode phase to Groq's LPU, we address the fundamental memory-bandwidth bottleneck in LLM inference,” said an NVIDIA spokesperson.
Comparison to Alternatives
AMD's upcoming Instinct MI400 and Intel's Gaudi 3 both aim at inference but lack a dedicated hardware decode accelerator. Google's TPU v6 (Trillium) handles decode via software kernel separation but uses HBM, not SRAM. Cerebras's wafer-scale engine offers a large SRAM pool (44 GB per chip) for decode but lacks a prefill counterpart and uses a proprietary system architecture that doesn't plug into NVIDIA racks. SambaNova's SN40L uses a reconfigurable dataflow architecture but is DRAM-based and lacks NVIDIA's software ecosystem. The Vera Rubin LPU pairing creates a unified heterogeneous inference rack with a single software stack (CUDA plus Groq drivers) — something no competitor currently offers.
The key differentiator is that the prefill-decode hardware split is new for commercially available systems. No other major GPU vendor has adopted this approach yet.
What's Still Unknown
- Exact pricing for Vera Rubin chips or LPX racks — important for calculating total cost of ownership.
- Actual benchmark results on specific models (claims are based on NVIDIA's own testing).
- Power consumption figures for the full Vera Rubin platform, not just the LPU die.
- Full specifications for the MGX-series racks.
- Details on how LPU integration affects overall system latency — especially the overhead of data transfer between GPU and LPU across PCIe or custom interconnects.
- Availability of LPX racks beyond early customers (Meta, OpenAI, Anthropic).
Analysis
The Vera Rubin LPU integration is an elegant but risky bet. The prefill-decode split acknowledges that GPUs are overprovisioned for inference — a point critics of NVIDIA's monolithic approach have made for years. If the software overhead (driver scheduling, inter-chip latency) stays low, the 10x cost reduction is plausible. But if kernel-level scheduling across two memory systems adds even a few milliseconds per token, the benefits could shrink significantly.
The bigger strategic question is about Groq's future. By integrating its LPU into NVIDIA's rack, Groq cedes its independence and becomes a component supplier. It's a reasonable outcome for a company that struggled to build its own server ecosystem, but it also makes Groq vulnerable: NVIDIA could develop its own SRAM-based decode unit in a future generation, cutting out Groq. The partnership feels like a prelude to acquisition.
Hyperscalers will also push back. AWS, Google, Azure, and Oracle all have internal inference accelerators. The 25% LPU capacity guidance NVIDIA offers is self-serving — it locks customers into a heterogeneous system that only NVIDIA provides end-to-end. If Google decides to deploy its own SRAM-based decoder instead of buying LPX racks, NVIDIA's market share could erode, especially in the fast-growing inference segment.
Finally, the power consumption claims need scrutiny. 90% lower per compute operation sounds impressive, but at the full system level — including the GPU's power draw, interconnects, and cooling — the savings may be less dramatic. Real TCO will depend on real workloads, not die-level marketing numbers. Without independent benchmarks on production models (GPT-4 scale, multi-modal, long-context), the 10x improvement remains an aspiration, not a certainty.