|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +PCI pass-thru devices |
| 4 | +========================= |
| 5 | +In a Hyper-V guest VM, PCI pass-thru devices (also called |
| 6 | +virtual PCI devices, or vPCI devices) are physical PCI devices |
| 7 | +that are mapped directly into the VM's physical address space. |
| 8 | +Guest device drivers can interact directly with the hardware |
| 9 | +without intermediation by the host hypervisor. This approach |
| 10 | +provides higher bandwidth access to the device with lower |
| 11 | +latency, compared with devices that are virtualized by the |
| 12 | +hypervisor. The device should appear to the guest just as it |
| 13 | +would when running on bare metal, so no changes are required |
| 14 | +to the Linux device drivers for the device. |
| 15 | + |
| 16 | +Hyper-V terminology for vPCI devices is "Discrete Device |
| 17 | +Assignment" (DDA). Public documentation for Hyper-V DDA is |
| 18 | +available here: `DDA`_ |
| 19 | + |
| 20 | +.. _DDA: https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/plan-for-deploying-devices-using-discrete-device-assignment |
| 21 | + |
| 22 | +DDA is typically used for storage controllers, such as NVMe, |
| 23 | +and for GPUs. A similar mechanism for NICs is called SR-IOV |
| 24 | +and produces the same benefits by allowing a guest device |
| 25 | +driver to interact directly with the hardware. See Hyper-V |
| 26 | +public documentation here: `SR-IOV`_ |
| 27 | + |
| 28 | +.. _SR-IOV: https://learn.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-single-root-i-o-virtualization--sr-iov- |
| 29 | + |
| 30 | +This discussion of vPCI devices includes DDA and SR-IOV |
| 31 | +devices. |
| 32 | + |
| 33 | +Device Presentation |
| 34 | +------------------- |
| 35 | +Hyper-V provides full PCI functionality for a vPCI device when |
| 36 | +it is operating, so the Linux device driver for the device can |
| 37 | +be used unchanged, provided it uses the correct Linux kernel |
| 38 | +APIs for accessing PCI config space and for other integration |
| 39 | +with Linux. But the initial detection of the PCI device and |
| 40 | +its integration with the Linux PCI subsystem must use Hyper-V |
| 41 | +specific mechanisms. Consequently, vPCI devices on Hyper-V |
| 42 | +have a dual identity. They are initially presented to Linux |
| 43 | +guests as VMBus devices via the standard VMBus "offer" |
| 44 | +mechanism, so they have a VMBus identity and appear under |
| 45 | +/sys/bus/vmbus/devices. The VMBus vPCI driver in Linux at |
| 46 | +drivers/pci/controller/pci-hyperv.c handles a newly introduced |
| 47 | +vPCI device by fabricating a PCI bus topology and creating all |
| 48 | +the normal PCI device data structures in Linux that would |
| 49 | +exist if the PCI device were discovered via ACPI on a bare- |
| 50 | +metal system. Once those data structures are set up, the |
| 51 | +device also has a normal PCI identity in Linux, and the normal |
| 52 | +Linux device driver for the vPCI device can function as if it |
| 53 | +were running in Linux on bare-metal. Because vPCI devices are |
| 54 | +presented dynamically through the VMBus offer mechanism, they |
| 55 | +do not appear in the Linux guest's ACPI tables. vPCI devices |
| 56 | +may be added to a VM or removed from a VM at any time during |
| 57 | +the life of the VM, and not just during initial boot. |
| 58 | + |
| 59 | +With this approach, the vPCI device is a VMBus device and a |
| 60 | +PCI device at the same time. In response to the VMBus offer |
| 61 | +message, the hv_pci_probe() function runs and establishes a |
| 62 | +VMBus connection to the vPCI VSP on the Hyper-V host. That |
| 63 | +connection has a single VMBus channel. The channel is used to |
| 64 | +exchange messages with the vPCI VSP for the purpose of setting |
| 65 | +up and configuring the vPCI device in Linux. Once the device |
| 66 | +is fully configured in Linux as a PCI device, the VMBus |
| 67 | +channel is used only if Linux changes the vCPU to be interrupted |
| 68 | +in the guest, or if the vPCI device is removed from |
| 69 | +the VM while the VM is running. The ongoing operation of the |
| 70 | +device happens directly between the Linux device driver for |
| 71 | +the device and the hardware, with VMBus and the VMBus channel |
| 72 | +playing no role. |
| 73 | + |
| 74 | +PCI Device Setup |
| 75 | +---------------- |
| 76 | +PCI device setup follows a sequence that Hyper-V originally |
| 77 | +created for Windows guests, and that can be ill-suited for |
| 78 | +Linux guests due to differences in the overall structure of |
| 79 | +the Linux PCI subsystem compared with Windows. Nonetheless, |
| 80 | +with a bit of hackery in the Hyper-V virtual PCI driver for |
| 81 | +Linux, the virtual PCI device is setup in Linux so that |
| 82 | +generic Linux PCI subsystem code and the Linux driver for the |
| 83 | +device "just work". |
| 84 | + |
| 85 | +Each vPCI device is set up in Linux to be in its own PCI |
| 86 | +domain with a host bridge. The PCI domainID is derived from |
| 87 | +bytes 4 and 5 of the instance GUID assigned to the VMBus vPCI |
| 88 | +device. The Hyper-V host does not guarantee that these bytes |
| 89 | +are unique, so hv_pci_probe() has an algorithm to resolve |
| 90 | +collisions. The collision resolution is intended to be stable |
| 91 | +across reboots of the same VM so that the PCI domainIDs don't |
| 92 | +change, as the domainID appears in the user space |
| 93 | +configuration of some devices. |
| 94 | + |
| 95 | +hv_pci_probe() allocates a guest MMIO range to be used as PCI |
| 96 | +config space for the device. This MMIO range is communicated |
| 97 | +to the Hyper-V host over the VMBus channel as part of telling |
| 98 | +the host that the device is ready to enter d0. See |
| 99 | +hv_pci_enter_d0(). When the guest subsequently accesses this |
| 100 | +MMIO range, the Hyper-V host intercepts the accesses and maps |
| 101 | +them to the physical device PCI config space. |
| 102 | + |
| 103 | +hv_pci_probe() also gets BAR information for the device from |
| 104 | +the Hyper-V host, and uses this information to allocate MMIO |
| 105 | +space for the BARs. That MMIO space is then setup to be |
| 106 | +associated with the host bridge so that it works when generic |
| 107 | +PCI subsystem code in Linux processes the BARs. |
| 108 | + |
| 109 | +Finally, hv_pci_probe() creates the root PCI bus. At this |
| 110 | +point the Hyper-V virtual PCI driver hackery is done, and the |
| 111 | +normal Linux PCI machinery for scanning the root bus works to |
| 112 | +detect the device, to perform driver matching, and to |
| 113 | +initialize the driver and device. |
| 114 | + |
| 115 | +PCI Device Removal |
| 116 | +------------------ |
| 117 | +A Hyper-V host may initiate removal of a vPCI device from a |
| 118 | +guest VM at any time during the life of the VM. The removal |
| 119 | +is instigated by an admin action taken on the Hyper-V host and |
| 120 | +is not under the control of the guest OS. |
| 121 | + |
| 122 | +A guest VM is notified of the removal by an unsolicited |
| 123 | +"Eject" message sent from the host to the guest over the VMBus |
| 124 | +channel associated with the vPCI device. Upon receipt of such |
| 125 | +a message, the Hyper-V virtual PCI driver in Linux |
| 126 | +asynchronously invokes Linux kernel PCI subsystem calls to |
| 127 | +shutdown and remove the device. When those calls are |
| 128 | +complete, an "Ejection Complete" message is sent back to |
| 129 | +Hyper-V over the VMBus channel indicating that the device has |
| 130 | +been removed. At this point, Hyper-V sends a VMBus rescind |
| 131 | +message to the Linux guest, which the VMBus driver in Linux |
| 132 | +processes by removing the VMBus identity for the device. Once |
| 133 | +that processing is complete, all vestiges of the device having |
| 134 | +been present are gone from the Linux kernel. The rescind |
| 135 | +message also indicates to the guest that Hyper-V has stopped |
| 136 | +providing support for the vPCI device in the guest. If the |
| 137 | +guest were to attempt to access that device's MMIO space, it |
| 138 | +would be an invalid reference. Hypercalls affecting the device |
| 139 | +return errors, and any further messages sent in the VMBus |
| 140 | +channel are ignored. |
| 141 | + |
| 142 | +After sending the Eject message, Hyper-V allows the guest VM |
| 143 | +60 seconds to cleanly shutdown the device and respond with |
| 144 | +Ejection Complete before sending the VMBus rescind |
| 145 | +message. If for any reason the Eject steps don't complete |
| 146 | +within the allowed 60 seconds, the Hyper-V host forcibly |
| 147 | +performs the rescind steps, which will likely result in |
| 148 | +cascading errors in the guest because the device is now no |
| 149 | +longer present from the guest standpoint and accessing the |
| 150 | +device MMIO space will fail. |
| 151 | + |
| 152 | +Because ejection is asynchronous and can happen at any point |
| 153 | +during the guest VM lifecycle, proper synchronization in the |
| 154 | +Hyper-V virtual PCI driver is very tricky. Ejection has been |
| 155 | +observed even before a newly offered vPCI device has been |
| 156 | +fully setup. The Hyper-V virtual PCI driver has been updated |
| 157 | +several times over the years to fix race conditions when |
| 158 | +ejections happen at inopportune times. Care must be taken when |
| 159 | +modifying this code to prevent re-introducing such problems. |
| 160 | +See comments in the code. |
| 161 | + |
| 162 | +Interrupt Assignment |
| 163 | +-------------------- |
| 164 | +The Hyper-V virtual PCI driver supports vPCI devices using |
| 165 | +MSI, multi-MSI, or MSI-X. Assigning the guest vCPU that will |
| 166 | +receive the interrupt for a particular MSI or MSI-X message is |
| 167 | +complex because of the way the Linux setup of IRQs maps onto |
| 168 | +the Hyper-V interfaces. For the single-MSI and MSI-X cases, |
| 169 | +Linux calls hv_compse_msi_msg() twice, with the first call |
| 170 | +containing a dummy vCPU and the second call containing the |
| 171 | +real vCPU. Furthermore, hv_irq_unmask() is finally called |
| 172 | +(on x86) or the GICD registers are set (on arm64) to specify |
| 173 | +the real vCPU again. Each of these three calls interact |
| 174 | +with Hyper-V, which must decide which physical CPU should |
| 175 | +receive the interrupt before it is forwarded to the guest VM. |
| 176 | +Unfortunately, the Hyper-V decision-making process is a bit |
| 177 | +limited, and can result in concentrating the physical |
| 178 | +interrupts on a single CPU, causing a performance bottleneck. |
| 179 | +See details about how this is resolved in the extensive |
| 180 | +comment above the function hv_compose_msi_req_get_cpu(). |
| 181 | + |
| 182 | +The Hyper-V virtual PCI driver implements the |
| 183 | +irq_chip.irq_compose_msi_msg function as hv_compose_msi_msg(). |
| 184 | +Unfortunately, on Hyper-V the implementation requires sending |
| 185 | +a VMBus message to the Hyper-V host and awaiting an interrupt |
| 186 | +indicating receipt of a reply message. Since |
| 187 | +irq_chip.irq_compose_msi_msg can be called with IRQ locks |
| 188 | +held, it doesn't work to do the normal sleep until awakened by |
| 189 | +the interrupt. Instead hv_compose_msi_msg() must send the |
| 190 | +VMBus message, and then poll for the completion message. As |
| 191 | +further complexity, the vPCI device could be ejected/rescinded |
| 192 | +while the polling is in progress, so this scenario must be |
| 193 | +detected as well. See comments in the code regarding this |
| 194 | +very tricky area. |
| 195 | + |
| 196 | +Most of the code in the Hyper-V virtual PCI driver (pci- |
| 197 | +hyperv.c) applies to Hyper-V and Linux guests running on x86 |
| 198 | +and on arm64 architectures. But there are differences in how |
| 199 | +interrupt assignments are managed. On x86, the Hyper-V |
| 200 | +virtual PCI driver in the guest must make a hypercall to tell |
| 201 | +Hyper-V which guest vCPU should be interrupted by each |
| 202 | +MSI/MSI-X interrupt, and the x86 interrupt vector number that |
| 203 | +the x86_vector IRQ domain has picked for the interrupt. This |
| 204 | +hypercall is made by hv_arch_irq_unmask(). On arm64, the |
| 205 | +Hyper-V virtual PCI driver manages the allocation of an SPI |
| 206 | +for each MSI/MSI-X interrupt. The Hyper-V virtual PCI driver |
| 207 | +stores the allocated SPI in the architectural GICD registers, |
| 208 | +which Hyper-V emulates, so no hypercall is necessary as with |
| 209 | +x86. Hyper-V does not support using LPIs for vPCI devices in |
| 210 | +arm64 guest VMs because it does not emulate a GICv3 ITS. |
| 211 | + |
| 212 | +The Hyper-V virtual PCI driver in Linux supports vPCI devices |
| 213 | +whose drivers create managed or unmanaged Linux IRQs. If the |
| 214 | +smp_affinity for an unmanaged IRQ is updated via the /proc/irq |
| 215 | +interface, the Hyper-V virtual PCI driver is called to tell |
| 216 | +the Hyper-V host to change the interrupt targeting and |
| 217 | +everything works properly. However, on x86 if the x86_vector |
| 218 | +IRQ domain needs to reassign an interrupt vector due to |
| 219 | +running out of vectors on a CPU, there's no path to inform the |
| 220 | +Hyper-V host of the change, and things break. Fortunately, |
| 221 | +guest VMs operate in a constrained device environment where |
| 222 | +using all the vectors on a CPU doesn't happen. Since such a |
| 223 | +problem is only a theoretical concern rather than a practical |
| 224 | +concern, it has been left unaddressed. |
| 225 | + |
| 226 | +DMA |
| 227 | +--- |
| 228 | +By default, Hyper-V pins all guest VM memory in the host |
| 229 | +when the VM is created, and programs the physical IOMMU to |
| 230 | +allow the VM to have DMA access to all its memory. Hence |
| 231 | +it is safe to assign PCI devices to the VM, and allow the |
| 232 | +guest operating system to program the DMA transfers. The |
| 233 | +physical IOMMU prevents a malicious guest from initiating |
| 234 | +DMA to memory belonging to the host or to other VMs on the |
| 235 | +host. From the Linux guest standpoint, such DMA transfers |
| 236 | +are in "direct" mode since Hyper-V does not provide a virtual |
| 237 | +IOMMU in the guest. |
| 238 | + |
| 239 | +Hyper-V assumes that physical PCI devices always perform |
| 240 | +cache-coherent DMA. When running on x86, this behavior is |
| 241 | +required by the architecture. When running on arm64, the |
| 242 | +architecture allows for both cache-coherent and |
| 243 | +non-cache-coherent devices, with the behavior of each device |
| 244 | +specified in the ACPI DSDT. But when a PCI device is assigned |
| 245 | +to a guest VM, that device does not appear in the DSDT, so the |
| 246 | +Hyper-V VMBus driver propagates cache-coherency information |
| 247 | +from the VMBus node in the ACPI DSDT to all VMBus devices, |
| 248 | +including vPCI devices (since they have a dual identity as a VMBus |
| 249 | +device and as a PCI device). See vmbus_dma_configure(). |
| 250 | +Current Hyper-V versions always indicate that the VMBus is |
| 251 | +cache coherent, so vPCI devices on arm64 always get marked as |
| 252 | +cache coherent and the CPU does not perform any sync |
| 253 | +operations as part of dma_map/unmap_*() calls. |
| 254 | + |
| 255 | +vPCI protocol versions |
| 256 | +---------------------- |
| 257 | +As previously described, during vPCI device setup and teardown |
| 258 | +messages are passed over a VMBus channel between the Hyper-V |
| 259 | +host and the Hyper-v vPCI driver in the Linux guest. Some |
| 260 | +messages have been revised in newer versions of Hyper-V, so |
| 261 | +the guest and host must agree on the vPCI protocol version to |
| 262 | +be used. The version is negotiated when communication over |
| 263 | +the VMBus channel is first established. See |
| 264 | +hv_pci_protocol_negotiation(). Newer versions of the protocol |
| 265 | +extend support to VMs with more than 64 vCPUs, and provide |
| 266 | +additional information about the vPCI device, such as the |
| 267 | +guest virtual NUMA node to which it is most closely affined in |
| 268 | +the underlying hardware. |
| 269 | + |
| 270 | +Guest NUMA node affinity |
| 271 | +------------------------ |
| 272 | +When the vPCI protocol version provides it, the guest NUMA |
| 273 | +node affinity of the vPCI device is stored as part of the Linux |
| 274 | +device information for subsequent use by the Linux driver. See |
| 275 | +hv_pci_assign_numa_node(). If the negotiated protocol version |
| 276 | +does not support the host providing NUMA affinity information, |
| 277 | +the Linux guest defaults the device NUMA node to 0. But even |
| 278 | +when the negotiated protocol version includes NUMA affinity |
| 279 | +information, the ability of the host to provide such |
| 280 | +information depends on certain host configuration options. If |
| 281 | +the guest receives NUMA node value "0", it could mean NUMA |
| 282 | +node 0, or it could mean "no information is available". |
| 283 | +Unfortunately it is not possible to distinguish the two cases |
| 284 | +from the guest side. |
| 285 | + |
| 286 | +PCI config space access in a CoCo VM |
| 287 | +------------------------------------ |
| 288 | +Linux PCI device drivers access PCI config space using a |
| 289 | +standard set of functions provided by the Linux PCI subsystem. |
| 290 | +In Hyper-V guests these standard functions map to functions |
| 291 | +hv_pcifront_read_config() and hv_pcifront_write_config() |
| 292 | +in the Hyper-V virtual PCI driver. In normal VMs, |
| 293 | +these hv_pcifront_*() functions directly access the PCI config |
| 294 | +space, and the accesses trap to Hyper-V to be handled. |
| 295 | +But in CoCo VMs, memory encryption prevents Hyper-V |
| 296 | +from reading the guest instruction stream to emulate the |
| 297 | +access, so the hv_pcifront_*() functions must invoke |
| 298 | +hypercalls with explicit arguments describing the access to be |
| 299 | +made. |
| 300 | + |
| 301 | +Config Block back-channel |
| 302 | +------------------------- |
| 303 | +The Hyper-V host and Hyper-V virtual PCI driver in Linux |
| 304 | +together implement a non-standard back-channel communication |
| 305 | +path between the host and guest. The back-channel path uses |
| 306 | +messages sent over the VMBus channel associated with the vPCI |
| 307 | +device. The functions hyperv_read_cfg_blk() and |
| 308 | +hyperv_write_cfg_blk() are the primary interfaces provided to |
| 309 | +other parts of the Linux kernel. As of this writing, these |
| 310 | +interfaces are used only by the Mellanox mlx5 driver to pass |
| 311 | +diagnostic data to a Hyper-V host running in the Azure public |
| 312 | +cloud. The functions hyperv_read_cfg_blk() and |
| 313 | +hyperv_write_cfg_blk() are implemented in a separate module |
| 314 | +(pci-hyperv-intf.c, under CONFIG_PCI_HYPERV_INTERFACE) that |
| 315 | +effectively stubs them out when running in non-Hyper-V |
| 316 | +environments. |
0 commit comments