|
| 1 | +# NvmExpressDxe |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +NvmExpressDxe is a UEFI driver that manages NVM Express (NVMe) non-volatile memory subsystems |
| 6 | +connected over PCI. It follows the UEFI Driver Model and the NVM Express specification to |
| 7 | +discover, initialize, and provide block-level access to NVMe storage devices during the UEFI |
| 8 | +boot phase. |
| 9 | + |
| 10 | +## Module Details |
| 11 | + |
| 12 | +| Field | Value | |
| 13 | +|---|---| |
| 14 | +| **Module Type** | UEFI_DRIVER | |
| 15 | +| **INF GUID** | `5BE3BDF4-53CF-46a3-A6A9-73C34A6E5EE3` | |
| 16 | +| **Entry Point** | `NvmExpressDriverEntry` | |
| 17 | +| **Unload** | `NvmExpressUnload` | |
| 18 | +| **Architectures** | IA32, X64, EBC | |
| 19 | + |
| 20 | +## What This Module Does |
| 21 | + |
| 22 | +### Driver Binding |
| 23 | + |
| 24 | +The driver implements the standard UEFI Driver Binding Protocol (`Supported`, `Start`, `Stop`) |
| 25 | +to attach to PCI devices with class code Mass Storage / NVM (0x01/0x08) and NVMHCI programming |
| 26 | +interface (0x02). |
| 27 | + |
| 28 | +### Controller Initialization |
| 29 | + |
| 30 | +During `Start`, the driver: |
| 31 | + |
| 32 | +1. Opens the `EFI_PCI_IO_PROTOCOL` on the controller handle. |
| 33 | +2. Reads the NVMe Controller Capabilities register (`CAP`) and validates NVM command set support. |
| 34 | +3. Allocates DMA-accessible buffers for admin submission/completion queues. |
| 35 | +4. Disables the controller, programs the Admin Queue Attributes (`AQA`), Admin Submission Queue |
| 36 | + Base Address (`ASQ`), and Admin Completion Queue Base Address (`ACQ`), then re-enables the |
| 37 | + controller. |
| 38 | +5. Sends Identify Controller to retrieve controller metadata (serial number, model, capabilities). |
| 39 | +6. Uses the Set Features command (Number of Queues) to negotiate I/O queue pairs with the |
| 40 | + controller. |
| 41 | +7. Allocates DMA-accessible buffers for I/O submission/completion queues and creates the I/O |
| 42 | + queue pairs via Create I/O Completion Queue and Create I/O Submission Queue admin commands. |
| 43 | +8. Enumerates NVMe namespaces and creates child handles for each discovered namespace. |
| 44 | + |
| 45 | +### Protocols Produced (per controller) |
| 46 | + |
| 47 | +| Protocol | Purpose | |
| 48 | +|---|---| |
| 49 | +| `EFI_NVM_EXPRESS_PASS_THRU_PROTOCOL` | Raw NVMe command passthrough for admin and I/O commands. Installed on the controller handle. | |
| 50 | +| `EFI_DRIVER_SUPPORTED_EFI_VERSION_PROTOCOL` | Declares the EFI specification version the driver supports. Installed on the driver image handle at entry point. | |
| 51 | + |
| 52 | +### Protocols Produced (per namespace) |
| 53 | + |
| 54 | +| Protocol | Purpose | |
| 55 | +|---|---| |
| 56 | +| `EFI_BLOCK_IO_PROTOCOL` | Synchronous block read/write/flush/reset operations. | |
| 57 | +| `EFI_BLOCK_IO2_PROTOCOL` | Asynchronous (non-blocking) block I/O operations. Only installed when the controller allocates more than one I/O queue pair. | |
| 58 | +| `EFI_DISK_INFO_PROTOCOL` | Exposes NVMe Identify Namespace data for disk information queries. | |
| 59 | +| `EFI_STORAGE_SECURITY_COMMAND_PROTOCOL` | Security Send/Receive commands (if the controller supports OACS bit 0). | |
| 60 | +| `MEDIA_SANITIZE_PROTOCOL` | Media Clear, Purge, and Format operations mapped to NVMe Format NVM and Sanitize admin commands per NIST SP 800-88 guidelines. | |
| 61 | + |
| 62 | +### Protocols Consumed |
| 63 | + |
| 64 | +| Protocol | Purpose | |
| 65 | +|---|---| |
| 66 | +| `EFI_PCI_IO_PROTOCOL` | PCI BAR memory access, DMA buffer allocation, and bus master mapping. | |
| 67 | +| `EFI_DEVICE_PATH_PROTOCOL` | Device path construction for namespace child handles. | |
| 68 | +| `EFI_RESET_NOTIFICATION_PROTOCOL` | Registers a shutdown callback to gracefully shut down all NVMe controllers before platform reset. | |
| 69 | + |
| 70 | +### Asynchronous I/O |
| 71 | + |
| 72 | +The driver uses a periodic timer event (`NVME_HC_ASYNC_TIMER`, 1 ms) to poll the async I/O |
| 73 | +completion queue and process completed asynchronous requests. The `BlockIo2` protocol is only |
| 74 | +installed when the controller has allocated at least two I/O queue pairs (one for blocking, one |
| 75 | +for async). |
| 76 | + |
| 77 | +### Controller Reset |
| 78 | + |
| 79 | +On command timeout, the driver performs a full controller reset: disable, re-program admin queues, |
| 80 | +re-enable, re-identify, re-negotiate queue count, and re-create I/O queues—all while preserving |
| 81 | +allocated DMA buffers. |
| 82 | + |
| 83 | +### Shutdown Notification |
| 84 | + |
| 85 | +The driver registers with `EFI_RESET_NOTIFICATION_PROTOCOL` to issue NVMe shutdown notifications |
| 86 | +(CC.SHN) to all managed controllers before a platform reset, ensuring data integrity. |
| 87 | + |
| 88 | +## Source Files |
| 89 | + |
| 90 | +| File | Description | |
| 91 | +|---|---| |
| 92 | +| `NvmExpress.c` | Driver entry point, driver binding, namespace enumeration, queue cleanup. | |
| 93 | +| `NvmExpress.h` | Main header: data structures, constants, macros, function declarations. | |
| 94 | +| `NvmExpressHci.c` | HCI register access, controller init/reset, admin and I/O queue creation. | |
| 95 | +| `NvmExpressHci.h` | HCI function declarations. | |
| 96 | +| `NvmExpressPassthru.c` | NVM Express PassThru protocol implementation (blocking and async). | |
| 97 | +| `NvmExpressBlockIo.c` | BlockIo and BlockIo2 protocol implementations. | |
| 98 | +| `NvmExpressBlockIo.h` | BlockIo/BlockIo2 function declarations. | |
| 99 | +| `NvmExpressDiskInfo.c` | DiskInfo protocol implementation. | |
| 100 | +| `NvmExpressDiskInfo.h` | DiskInfo function declarations. | |
| 101 | +| `NvmExpressMediaSanitize.c` | Media Sanitize protocol: Clear, Purge, Format via NVMe commands. | |
| 102 | +| `NvmExpressMediaSanitize.h` | Media Sanitize function declarations and types. | |
| 103 | +| `ComponentName.c` | Component Name and Component Name2 protocol implementations. | |
| 104 | +| `UnitTest/MediaSanitizeUnitTest.c` | Host-based unit tests for the Media Sanitize functionality. | |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +## MU_CHANGE Summary |
| 109 | + |
| 110 | +This section documents all Microsoft (Project Mu) changes made to the upstream EDK2 NvmExpressDxe |
| 111 | +driver. Each change is tagged in the source with `// MU_CHANGE` comments. |
| 112 | + |
| 113 | +### 1. Allocate IO Queue Buffer |
| 114 | + |
| 115 | +**Tag:** `MU_CHANGE - Allocate IO Queue Buffer` |
| 116 | + |
| 117 | +**Files:** `NvmExpress.h`, `NvmExpress.c`, `NvmExpressHci.h`, `NvmExpressHci.c`, `NvmExpressBlockIo.c`, `NvmExpressPassthru.c` |
| 118 | + |
| 119 | +**What changed:** |
| 120 | + |
| 121 | +The upstream driver allocates a single flat 6-page DMA buffer at `DriverBindingStart` time and |
| 122 | +carves fixed 4 KiB regions out of it for all six queues (admin SQ, admin CQ, I/O SQ #1, I/O CQ #1, |
| 123 | +I/O SQ #2, I/O CQ #2). This MU change replaces that approach with a split allocation model: |
| 124 | + |
| 125 | +- **Admin queues** are allocated separately based on the actual admin queue entry size and count |
| 126 | + derived from controller capabilities (via `NVME_SQ_SIZE_IN_PAGES` / `NVME_CQ_SIZE_IN_PAGES` |
| 127 | + macros). |
| 128 | +- **I/O queues** are allocated in a separate DMA buffer (`IoQueueBuffer` / `IoQueueBufferPciAddr`) |
| 129 | + whose size is computed dynamically based on the negotiated number of I/O queue pairs and their |
| 130 | + entry sizes. |
| 131 | +- A new structure `NVME_QUEUE_SIZE_DATA` (with `NumberOfEntries` and `EntrySize` fields) is added |
| 132 | + to track per-queue sizing metadata in the controller private data. |
| 133 | +- Queue buffer page counts are computed using macros (`NVME_SQ_SIZE_IN_PAGES`, |
| 134 | + `NVME_CQ_SIZE_IN_PAGES`) that use the actual entry size (as a power of 2) rather than assuming |
| 135 | + fixed 4 KiB pages. |
| 136 | +- New functions are introduced: |
| 137 | + - `NvmeControllerInitAdminQueues()` — initializes admin queue buffer pointers and programs ASQ/ACQ. |
| 138 | + - `NvmeControllerInitIoQueues()` — initializes I/O queue buffer pointers and creates I/O CQ/SQ. |
| 139 | + - `NvmExpressDriverCleanUpQueues()` — unmaps and frees both admin and I/O queue DMA buffers. |
| 140 | + - `NvmeControllerReset()` — performs a full controller reset reusing existing buffer allocations. |
| 141 | + - `ReadNvmeAdminQueueAttributes()` — reads the AQA register for validation during reset. |
| 142 | +- The `NvmeEnableController()` function now accepts `IoSqEs` and `IoCqEs` parameters to program |
| 143 | + the correct queue entry sizes into `CC.IOSQES` and `CC.IOCQES`. |
| 144 | +- Page mask operations on queue base addresses are removed since buffers allocated via |
| 145 | + `AllocatePages` are already page-aligned. |
| 146 | + |
| 147 | +**Why it's needed:** |
| 148 | + |
| 149 | +The fixed 6-page allocation is insufficient when queue sizes exceed 1 entry (i.e., when using |
| 150 | +alternative queue sizes like 255 entries). The dynamic allocation supports variable queue entry |
| 151 | +counts and entry sizes, allows the admin and I/O queues to be managed independently, and enables |
| 152 | +proper cleanup and reset without leaking DMA memory. Separating the I/O queue buffer also means |
| 153 | +the driver can scale the allocation based on how many queue pairs the controller actually grants. |
| 154 | + |
| 155 | +--- |
| 156 | + |
| 157 | +### 2. Request Number of Queues from Controller |
| 158 | + |
| 159 | +**Tag:** `MU_CHANGE - Request Number of Queues from Controller` |
| 160 | + |
| 161 | +**Files:** `NvmExpress.h`, `NvmExpress.c`, `NvmExpressHci.c` |
| 162 | + |
| 163 | +**What changed:** |
| 164 | + |
| 165 | +- A new function `NvmeSetFeaturesNumberOfQueues()` is added. It sends the NVMe Set Features |
| 166 | + command (Feature ID: Number of Queues) to request the desired number of I/O queue pairs from |
| 167 | + the controller. The controller may allocate fewer pairs than requested; the driver stores the |
| 168 | + actual granted count in `Private->NumberOfIoQueuePairs`. |
| 169 | +- The maximum number of queue pairs the driver requests is defined by `NVME_MAX_QUEUES` (3 total: |
| 170 | + 1 admin + 2 I/O), meaning the driver requests up to 2 I/O queue pairs. |
| 171 | +- The `NVME_SUPPORT_BLOCKIO2()` macro checks whether the controller allocated more than 1 I/O |
| 172 | + queue pair. If not, the `BlockIo2` protocol and async timer event are **not** installed. |
| 173 | +- `NvmeCreateIoCompletionQueue()` and `NvmeCreateIoSubmissionQueue()` loop from index 1 to |
| 174 | + `NumberOfIoQueuePairs` instead of using hardcoded indices. |
| 175 | +- `BlockIo2` protocol installation/uninstallation in `EnumerateNvmeDevNamespace()` and |
| 176 | + `UnregisterNvmeNamespace()` is made conditional on `NVME_SUPPORT_BLOCKIO2()`. |
| 177 | + |
| 178 | +**Why it's needed:** |
| 179 | + |
| 180 | +The upstream driver assumes every controller supports exactly two I/O queue pairs. Some NVMe |
| 181 | +controllers (especially embedded or resource-constrained ones) may only support a single I/O |
| 182 | +queue pair. By querying the controller via Set Features and gracefully degrading (skipping |
| 183 | +`BlockIo2` when only one queue pair is available), the driver avoids failures on controllers that |
| 184 | +cannot satisfy a two-queue-pair request and correctly reflects the controller's actual capabilities. |
| 185 | + |
| 186 | +--- |
| 187 | + |
| 188 | +### 3. Support Alternative Hardware Queue Sizes in NVME Driver |
| 189 | + |
| 190 | +**Tag:** `MU_CHANGE - Support alternative hardware queue sizes in NVME driver` |
| 191 | + |
| 192 | +**Files:** `NvmExpress.h`, `NvmExpress.c`, `NvmExpressDxe.inf`, `NvmExpressHci.c`, `NvmExpressPassthru.c` |
| 193 | + |
| 194 | +**What changed:** |
| 195 | + |
| 196 | +- A PCD `PcdSupportAlternativeQueueSize` (Boolean) is consumed. When `TRUE`, the driver uses a |
| 197 | + maximum queue size of 255 entries (`NVME_ALTERNATIVE_MAX_QUEUE_SIZE`) instead of the default |
| 198 | + sizes (1 for sync, 63/255 for async). |
| 199 | +- Queue creation (`NvmeCreateIoCompletionQueue`, `NvmeCreateIoSubmissionQueue`) uses |
| 200 | + `MIN(NVME_ALTERNATIVE_MAX_QUEUE_SIZE, Cap.Mqes)` for all queues when the PCD is enabled. |
| 201 | +- Admin queue sizes (`AQA.ASQS`, `AQA.ACQS`) also use the alternative size when the PCD is set. |
| 202 | +- Passthrough command handling (`NvmExpressPassThru`) adjusts queue head/tail pointer arithmetic |
| 203 | + to use modular wrap-around (instead of XOR toggle) when the alternative queue size is active, |
| 204 | + supporting queue depths greater than 1. |
| 205 | +- The async task list processor (`ProcessAsyncTaskList`) similarly uses the alternative queue size |
| 206 | + for completion queue head management. |
| 207 | + |
| 208 | +**Why it's needed:** |
| 209 | + |
| 210 | +Some NVMe hardware implementations require a minimum queue depth greater than 1 (e.g., 255 |
| 211 | +entries). The upstream driver defaults to queue sizes of 1 for synchronous I/O and uses XOR-based |
| 212 | +head/tail toggling which only works with 2-entry queues (0 and 1). When hardware requires deeper |
| 213 | +queues, this feature enables proper modular arithmetic for queue management and allocates |
| 214 | +appropriately sized buffers. This is controlled by a PCD so platforms that don't need it retain the |
| 215 | +original behavior. |
| 216 | + |
| 217 | +--- |
| 218 | + |
| 219 | +### 4. NVMe Namespace Filtering |
| 220 | + |
| 221 | +**Tag:** `MU_CHANGE - NVMe namespace filtering` |
| 222 | + |
| 223 | +**Files:** `NvmExpress.h`, `NvmExpress.c`, `NvmExpressDxe.inf` |
| 224 | + |
| 225 | +**What changed:** |
| 226 | + |
| 227 | +- A PCD `PcdNvmeNamespaceFilter` (Boolean) is consumed. When `TRUE`, namespace discovery is |
| 228 | + limited to only the first namespace (NSID 1). |
| 229 | +- A constant `NVME_FIRST_NSID` (0x00000001) is defined. |
| 230 | +- `DiscoverAllNamespaces()` takes a new `FilteringEnabled` parameter. When set, the loop breaks |
| 231 | + after enumerating the first namespace instead of iterating through all namespaces. |
| 232 | +- In the `RemainingDevicePath` path, if filtering is enabled and the requested `NamespaceId` is |
| 233 | + not `NVME_FIRST_NSID`, the namespace is skipped. |
| 234 | + |
| 235 | +**Why it's needed:** |
| 236 | + |
| 237 | +In some platform or test configurations, it is desirable to restrict which NVMe namespaces are |
| 238 | +exposed to the UEFI environment. For example, a system with a multi-namespace NVMe device may |
| 239 | +only want the boot namespace (typically NSID 1) available during boot to reduce enumeration time, |
| 240 | +limit attack surface, or avoid exposing non-boot partitions. This PCD-controlled filter provides |
| 241 | +that capability without modifying the driver at build time. |
| 242 | + |
| 243 | +--- |
| 244 | + |
| 245 | +### 5. Use the Mqes Value from the Cap Register |
| 246 | + |
| 247 | +**Tag:** `MU_CHANGE - Use the Mqes value from the Cap register` |
| 248 | + |
| 249 | +**Files:** `NvmExpressHci.c` |
| 250 | + |
| 251 | +**What changed:** |
| 252 | + |
| 253 | +- When creating I/O completion and submission queues, the queue size is clamped to |
| 254 | + `MIN(requested_size, Cap.Mqes)` instead of using the requested size directly. |
| 255 | + |
| 256 | +**Why it's needed:** |
| 257 | + |
| 258 | +The NVMe `CAP.MQES` field reports the maximum queue entries supported by the controller. If the |
| 259 | +driver requests a queue larger than what the controller supports, the behavior is undefined or the |
| 260 | +command fails. By clamping queue sizes to `MQES`, the driver respects the controller's hardware |
| 261 | +limits and avoids creating oversized queues. |
| 262 | + |
| 263 | +--- |
| 264 | + |
| 265 | +### 6. Correct Cap Parameter Modifier |
| 266 | + |
| 267 | +**Tag:** `MU_CHANGE - Correct Cap parameter modifier` |
| 268 | + |
| 269 | +**Files:** `NvmExpressHci.c` |
| 270 | + |
| 271 | +**What changed:** |
| 272 | + |
| 273 | +- The `ReadNvmeControllerCapabilities()` function signature is corrected so that the `Cap` |
| 274 | + parameter uses the `OUT` modifier instead of `IN`, reflecting that the function writes to (not |
| 275 | + reads from) this parameter. |
| 276 | + |
| 277 | +**Why it's needed:** |
| 278 | + |
| 279 | +A correctness fix. The `Cap` parameter is an output of `ReadNvmeControllerCapabilities()`—the |
| 280 | +function reads the hardware register and writes the result into the caller's buffer. Marking it |
| 281 | +`IN` was semantically incorrect and could mislead static analysis tools or code reviewers. |
| 282 | + |
| 283 | +--- |
| 284 | + |
| 285 | +### 7. Improve NVMe Controller Init Robustness |
| 286 | + |
| 287 | +**Tag:** `MU_CHANGE - Improve NVMe controller init robustness` |
| 288 | + |
| 289 | +**Files:** `NvmExpressHci.c` |
| 290 | + |
| 291 | +**What changed:** |
| 292 | + |
| 293 | +- At the start of `NvmeControllerInit()` (and `NvmeControllerReset()`), the driver reads the PCI |
| 294 | + Vendor ID and Device ID. If either returns `0xFFFF` (`NVME_INVALID_VID_DID`), the function |
| 295 | + returns `EFI_DEVICE_ERROR` immediately. |
| 296 | +- The assertion on `Cap.Mpsmin` is replaced with a conditional check and `EFI_DEVICE_ERROR` return, |
| 297 | + so the driver fails gracefully instead of asserting if the controller reports an unsupported |
| 298 | + minimum page size. |
| 299 | + |
| 300 | +**Why it's needed:** |
| 301 | + |
| 302 | +If an NVMe controller has been surprise-removed (hot-unplug), is behind a failed PCIe link, or |
| 303 | +is otherwise inaccessible, PCI config reads return all-ones (`0xFFFF`). Without this check, the |
| 304 | +driver would proceed to access invalid MMIO space, potentially causing system hangs or crashes. |
| 305 | +The `Mpsmin` check change prevents a hard assert in production builds if the controller reports an |
| 306 | +unexpected minimum memory page size, instead returning a clean error. |
| 307 | + |
| 308 | +--- |
| 309 | + |
| 310 | +### 8. Remove Page Mask |
| 311 | + |
| 312 | +**Tag:** `MU_CHANGE - Remove Page Mask` / `MU_CHANGE - Remove the page mask since the buffer is allocated using AllocatePages` |
| 313 | + |
| 314 | +**Files:** `NvmExpressHci.c` |
| 315 | + |
| 316 | +**What changed:** |
| 317 | + |
| 318 | +- The page-alignment mask operations (`& ~(EFI_PAGE_SIZE - 1)`) on the admin submission queue |
| 319 | + (ASQ) and admin completion queue (ACQ) base addresses are removed. |
| 320 | + |
| 321 | +**Why it's needed:** |
| 322 | + |
| 323 | +Since the admin queue buffers are allocated using `PciIo->AllocateBuffer()` with |
| 324 | +`AllocateAnyPages`, the returned addresses are already guaranteed to be page-aligned. Applying a |
| 325 | +page mask is redundant and was removed for clarity. This is part of the broader Allocate IO Queue |
| 326 | +Buffer refactoring. |
0 commit comments