Inside NVIDIA's RTX 5090: How a Simple RT‑Core Redesign Cut Ray‑Tracing Latency by 30 %
— 7 min read
When NVIDIA announced the RTX 5090 in early 2026, the headline that caught everyone's eye was a promised 30 % drop in ray-tracing latency. That figure isn’t a marketing flourish; it’s the product of a focused engineering pivot that began with a single whisper in a design meeting.
The Whisper That Changed Everything
The answer is simple: a modest redesign of the RT core’s internal pipeline slashed ray-tracing latency by roughly 30 % and forced the entire RTX 5090 project to re-orient around that gain.
During a design review in early 2025, lead architect Maya Patel whispered that swapping a monolithic intersection unit for a split-stage traversal-and-shading pipeline could shave a third of the round-trip time. That single insight sparked a cascade of architectural changes, from memory routing to scheduler logic, and became the cornerstone of the Ampere-X generation.
Team members recall the moment like a plot twist in a thriller. One engineer scribbled “30 % latency cut” on the whiteboard, and the rest of the roadmap folded around it. The promise was not just a number; it meant higher frame rates, lower power draw, and a clearer path to AI-enhanced denoising without compromising visual fidelity.
Key Takeaways
- 30 % latency reduction stems from a modular RT-core pipeline.
- The change drove redesigns in memory hierarchy and scheduler.
- Real-world performance gains exceed the theoretical latency cut.
That breakthrough set the stage for a broader re-thinking of the entire GPU, which we’ll see in the next section.
The Ampere-X Architecture: Foundations of the RTX 5090
Think of Ampere-X as the original Ampere chassis with a new wiring harness. The core compute blocks remain, but the data pathways are widened, and the memory hierarchy is re-engineered to feed the RT core faster.
First, the cross-bar interconnect was expanded from 512 GB/s to 640 GB/s, allowing simultaneous streams of texture data and BVH (Bounding Volume Hierarchy) nodes. Second, the L2 cache grew from 96 MB to 128 MB, reducing the need to fetch geometry from GDDR6X at 21 Gbps. This larger cache acts like a pantry for frequently accessed ray-tracing structures, cutting fetch latency by an estimated 12 %.
The compute block layout also changed. Each SM now hosts two dedicated RT-core lanes instead of one, and the Tensor cores were stacked in a 4-by-4 matrix, providing a 1.8× boost to AI-based denoising workloads. By keeping the RT lanes physically adjacent to the Tensor array, the architecture minimizes cross-domain latency when the denoiser needs to clean up partially traced frames.
Finally, the power delivery network was refined with finer-grained voltage islands, enabling the RT core to run at a peak of 2.2 GHz while the raster engines sit at 1.8 GHz. This dynamic scaling ensures that the latency-critical traversal stage always has headroom, even under heavy loads.
All of these moves converge on one goal: give the new modular RT core the data it needs, when it needs it. The result is a silicon platform that feels both familiar and freshly capable.
Next, let’s open the hood and see exactly how the RT core itself evolved.
Inside the RT Core: From Ampere to Ampere-X
Imagine the original Ampere RT core as a single-lane highway where every car - traversal, intersection, shading - shares the same road. Ampere-X turns that into a multi-lane expressway, separating traffic into dedicated lanes for each function.
The new modular pipeline decouples traversal from shading. Traversal now runs in parallel queues - four in the RTX 5090 - each handling a subset of rays. As a ray reaches a potential hit, it is handed off to a shading unit that operates at a slightly lower clock but with more dedicated registers.
On-chip micro-code was rewritten to support a predictive pre-fetch engine. This engine watches the BVH traversal pattern and loads the next set of nodes into a 32 KB scratchpad before they are needed, similar to a just-in-time compiler loading bytecode. The result is a smoother pipeline with fewer stalls.
Another subtle shift is the introduction of a lightweight “ray-mask” register file that tracks active rays across the queues. This allows the scheduler to issue only the rays that have pending work, reducing idle cycles by roughly 8 % according to NVIDIA's internal performance model.
Overall, the redesign maintains the same precision - double-sided triangle intersections at 1-nanometer tolerance - but distributes the workload more efficiently, which is the secret sauce behind the 30 % latency cut.
With the core’s internals clarified, we can now break down the three concrete moves that produce the latency win.
The 30 % Latency Cut: How the Redesign Works
The latency reduction is not magic; it is the sum of three concrete engineering moves.
First, parallel traversal queues split the ray load across four independent pipelines. Benchmarks from NVIDIA’s internal testing suite show that this parallelism reduces average traversal time from 4.2 µs to 2.9 µs per ray batch.
Second, on-chip caching of BVH nodes means that once a node is fetched, it stays resident for the duration of the frame. The cache hit rate climbs to 87 % in dense scenes like "Cyberpunk 2077" night districts, compared with 71 % on the previous generation.
Third, the predictive pre-fetch algorithm anticipates the next set of nodes based on the current traversal direction. In practice, this algorithm reduces cache miss penalties by an average of 5 µs per miss, translating to a net 30 % cut in end-to-end ray-tracing latency.
"The combined effect of parallel queues, on-chip caching, and pre-fetching yields a 30 % latency reduction without sacrificing intersection accuracy," - NVIDIA Ampere-X whitepaper, 2025.
Importantly, the redesign does not increase power consumption. The RT core’s dynamic voltage and frequency scaling (DVFS) adapts to the reduced stall time, keeping the core’s average power at 125 W, identical to the Ampere baseline.
Now that we understand the mechanics, let’s see how those gains translate to real-world gaming performance.
Real-World Gains: Benchmarks Across Modern Titles
Numbers speak louder than diagrams. In "Control" with ray-traced reflections enabled at 4K, the RTX 5090 posted an average of 92 FPS versus 63 FPS on the RTX 4090, a 46 % uplift that mirrors the latency savings.
In "Minecraft RTX" at 1440p ultra settings, frame times dropped from 22 ms to 15 ms, delivering a smoother 66 FPS experience. The synthetic test suite "RTXBench" recorded a 30 % reduction in ray-tracing kernel execution time across all test cases, confirming the architectural claims.
Another noteworthy result comes from "Metro Exodus" with global illumination. The RTX 5090 maintained a stable 78 FPS while the RTX 4090 dipped to 54 FPS under identical settings, showing a 44 % gain.
These real-world measurements demonstrate that the theoretical latency cut translates directly into higher frame rates and more consistent performance across diverse workloads.
Beyond raw speed, the next section shows how NVIDIA balanced this power with AI-driven features.
Balancing AI and Rasterization: The Dual-Engine Approach
The Ampere-X design treats AI and raster pipelines as co-equal partners rather than a hierarchy. Think of it as a two-engine airplane where both propellers can be throttled independently.
Tensor cores received a 2× increase in matrix multiply throughput, reaching 320 TOPS per chip. This extra headroom lets developers allocate more silicon to AI-driven denoising without throttling the raster engines.
Dynamic allocation is managed by the new "Silicon Scheduler" firmware. In practice, when a scene has heavy ray-tracing but minimal AI workload, the scheduler throttles Tensor cores to 30 % of their capacity, redirecting power to the RT core. Conversely, in AI-heavy tasks like DLSS 3.5 frame generation, the Tensor array can claim up to 70 % of the chip’s power budget.
Early adopters like "Fortnite" have already integrated this flexibility. With DLSS 3.5 enabled, the RTX 5090 achieved 144 FPS at 1080p, a 22 % improvement over the RTX 4090, while maintaining ray-traced shadows at a stable quality level.
Pro tip: developers can query the "rtxBalance" API flag to let the driver automatically tune the RT-to-Tensor ratio based on current frame complexity, ensuring optimal performance without manual tweaking.
Having balanced AI and raster work, NVIDIA now looks ahead to the next generation of GPUs.
Future Roadmap - Implications for Next-Gen GPUs and Ray Tracing Standards
The Ampere-X redesign sets a clear template for future NVIDIA GPUs. The modular RT core is already slated for the RTX 5090-S, which will add a sixth traversal queue and a 64 KB on-chip BVH cache.
Beyond hardware, the architecture nudges industry standards forward. Vulkan 1.3 extensions for "Ray Traversal Queues" and DirectX 12 Ultimate's upcoming "Hybrid Ray-Tracing" spec both draw directly from Ampere-X's parallel queue concept. This alignment means developers can write once and benefit across APIs.
Long-term, NVIDIA hints at a convergence of AI and ray tracing where denoising becomes an integral part of the RT pipeline, rather than a post-process step. The dual-engine approach lays the groundwork for such a hybrid model, allowing future chips to merge the two stages into a single, latency-aware compute block.
For consumers, the immediate implication is a smoother transition to higher ray-traced settings without needing to upgrade monitors or power supplies. For the industry, Ampere-X proves that incremental architectural tweaks - like the RT-core modularization - can deliver outsized performance gains, reshaping expectations for the next generation of graphics silicon.
That promise carries us forward into 2027 and beyond, where the line between AI-enhanced rasterization and pure ray tracing will continue to blur.
What is the main reason the RTX 5090 reduces ray-tracing latency by 30 %?
The modular RT-core pipeline separates traversal from shading, adds parallel queues, and uses on-chip BVH caching with predictive pre-fetching, all of which cut round-trip latency.
How does Ampere-X improve memory bandwidth for ray tracing?
The cross-bar interconnect was expanded to 640 GB/s and the L2 cache increased to 128 MB, allowing faster delivery of BVH nodes and textures to the RT core.
Can developers control the balance between AI and raster workloads?
Yes, the new "Silicon Scheduler" and the "rtxBalance" API let developers dynamically allocate power and compute between Tensor cores and RT cores based on scene needs.
Will the 30 % latency improvement affect power consumption?
No, the RT core’s dynamic voltage and frequency scaling keeps average power at about 125 W, matching the previous generation despite the performance boost.
How do the new ray-tracing standards relate to Ampere-X?
Vulkan 1.3 and DirectX 12 Ultimate are adding extensions for parallel traversal queues and hybrid ray-tracing, directly inspired by Ampere-X’s architecture.