NVIDIA's High-End GeForce RTX 5090 and RTX PRO 6000 GPUs Allegedly Impacted by Virtualization Bug 33

NVIDIA’s High-End GeForce RTX 5090 and RTX PRO 6000 GPUs Allegedly Impacted by Virtualization Bug

Random Image

Bug Hits NVIDIA’s Flagship GPUs in Virtualized Setups

NVIDIA's top GPUs, the GeForce RTX 5090 and RTX PRO 6000, face a serious bug in virtual machines. The problem makes the cards unresponsive after days of use in VM environments. A reboot of the entire host node is needed to recover, which disrupts many client setups at once.

What Hardware Is Affected

So far, the issue seems limited to Blackwell-based RTX 5090 and RTX PRO 6000 models. Older parts like the RTX 4090, Hopper H100, and B200 Blackwell GPUs do not show the same trouble. The fault shows up when a GPU is passed through to a VM via the VFIO device driver. After a Function Level Reset (FLR), the card stops answering, triggering a kernel soft lock that freezes both host and guest systems.

How the Problem Manifests in Practice

The bug appears once the GPU is assigned to a VM and then reset. In testing, a sudden loss of response follows, and the machine can stall until a reboot happens. The lock affects both the host and its guests, making services on the VM hard to reach and breaking ongoing work in AI tasks or data runs. This is a drain on time and can cause missed deadlines in busy data centers.

Reports From the Community

CloudRift, a GPU cloud service for developers, first flagged the issue. Their tests show that after a few days of VM assignments, affected GPUs stop responding. A Proxmox user also described a full host crash after shutting down a Windows client VM with an RTX 5090 attached. These reports show the bug is not rare and can hit real-world setups hard. NVIDIA has reportedly acknowledged the issue and says they can reproduce it internally. They are now looking for a fix.

What NVIDIA Is Doing

NVIDIA confirmed the bug and began work on a fix after internal tests. The company says it has been able to reproduce the problem and is examining driver or firmware changes that could solve it. No timeline has been shared yet, but industry watchers expect a patch or new driver release soon. The rapid response reflects how critical these cards have become for AI workloads and virtualization tasks. In particular, virtualization on Windows enables these scenarios.

Bug Bounty, Fixes, and What It Means for Teams

CloudRift has posted a $1,000 bug bounty for a fix or usable workaround. The offer underscores how important it is to have stable virtualization on these GPUs. Given the role of AI tasks and VM workloads in many shops, a fast patch would be welcomed by admins and cloud providers alike. In the meantime, teams may need to plan around this risk and test any fixes in a lab before moving to production.

Why This Matters for Data Centers and AI Labs

This hiccup slows down AI projects that rely on VM hosts. The ability to pass a GPU to a VM is crucial in mixed environments, where multiple teams share hardware. If the fix lands quickly, it could restore confidence in running Blackwell-based cards in virtualized pools. Until then, operators should keep a close watch on updates and weigh the cost of potential downtime against the need for heavy AI compute. For added context, virtualization considerations are important, including virtualization best practices.

A Practical Look at the Road Ahead

For now, expect patches to roll out in the near term. Vendors will likely push driver updates and perhaps firmware tweaks to address the FLR-induced lock. Enterprises should plan for testing windows and consider modest alternatives if their VM workloads are time-sensitive. The goal is clear: restore stability without slowing down AI research or production runs.

Please note that when you make a purchase through our links at GameHaunt, we might earn a small commission. This helps us keep bringing you the free journalism you love on our site! And don’t worry, our editorial content remains totally unbiased. If you’d like to show some support, you can do so here.