NVMe over Fabrics (NVMe-oF) Networking

Updated on
11 min read

Enterprise data centers face a fundamental challenge: local NVMe storage delivers microsecond latency, but traditional network storage protocols add hundreds of microseconds of overhead when that storage needs to be shared. NVMe over Fabrics (NVMe-oF) solves this problem by extending the NVMe protocol across network fabrics, enabling remote storage access that approaches local NVMe performance. Storage administrators, data center architects, and infrastructure engineers responsible for high-performance storage systems need to understand how NVMe-oF works and when to deploy it.

What is NVMe over Fabrics (NVMe-oF)?

NVMe over Fabrics is a protocol specification that extends the Non-Volatile Memory Express (NVMe) storage protocol beyond local PCIe connections to work over network fabrics. Instead of translating NVMe commands to SCSI like traditional storage area networks (SANs), NVMe-oF preserves the native NVMe protocol advantages over SATA end-to-end across the network.

The NVM Express Organization standardizes NVMe-oF with multiple transport bindings that allow the same NVMe command set to operate over different network technologies. This design maintains NVMe’s core characteristics—parallel queue architecture, low command overhead, and direct memory access—even when storage is physically remote.

Unlike traditional storage protocols like iSCSI that evolved from rotating disk assumptions, NVMe-oF was designed specifically for solid-state storage characteristics. The protocol supports up to 64,000 I/O queues compared to iSCSI’s single queue, enabling massive parallelism that matches modern CPU core counts and workload concurrency.

The Problem NVMe-oF Solves

Local NVMe SSDs connected via PCIe deliver latencies under 10 microseconds, but this performance remains isolated to a single server. When applications need shared storage—database clusters, virtualization environments, or container orchestras—traditional approaches force a choice between performance and accessibility.

Converting NVMe to legacy protocols introduces significant latency penalties. iSCSI adds 200-500 microseconds by encapsulating SCSI commands in TCP/IP, going through the full network stack, and translating between command sets. Fibre Channel improves this to 50-100 microseconds but requires expensive dedicated infrastructure. Both protocols were designed for mechanical disk latencies measured in milliseconds, making their overhead acceptable for that generation of storage.

Modern workloads have different requirements. AI model training needs rapid access to terabyte-scale datasets across GPU clusters. Financial trading systems demand microsecond-level storage latency for real-time analytics. Database systems running on flash storage expose network and protocol overhead as the primary bottleneck rather than the storage media itself.

Data center consolidation strategies further compound this problem. Organizations want to pool NVMe storage resources efficiently, provision storage dynamically across workloads, and maintain high availability without sacrificing the performance benefits that justified investing in NVMe technology.

How NVMe-oF Works: Architecture Overview

NVMe-oF uses a client-server model with terminology borrowed from traditional storage networking. The host or initiator is the client system that needs to access remote storage, running NVMe-oF driver software. The target or subsystem is the storage server that exposes NVMe devices or namespaces to the network.

The protocol maintains a clear separation between the NVMe command set and the transport layer. NVMe commands—read, write, flush, and administrative operations—remain identical whether sent over PCIe, RDMA, TCP, or Fibre Channel. This transport abstraction allows NVMe-oF to support multiple network technologies with the same software stack.

Namespaces are the fundamental storage units in NVMe-oF. A namespace represents a quantity of non-volatile memory that can be formatted into logical blocks, similar to a partition or LUN in traditional storage. A single physical NVMe device can expose multiple namespaces, and targets can aggregate namespaces from different devices into a single subsystem.

Discovery services enable dynamic target identification. Hosts query a discovery controller to learn which subsystems are available, their transport addresses, and access requirements. This automation simplifies large-scale deployments where manually configuring hundreds of storage connections becomes impractical.

Multi-path support provides redundancy and load balancing. A host can establish multiple connections to the same subsystem through different network paths, automatically failing over if a path becomes unavailable and distributing I/O across paths for higher aggregate bandwidth.

NVMe-oF Transport Options Explained

NVMe-oF defines multiple transport bindings, each optimized for different infrastructure scenarios. Understanding the tradeoffs guides appropriate technology selection.

NVMe over RDMA (NVMe/RoCE or NVMe/InfiniBand) delivers the lowest latency, typically under 10 microseconds. Remote Direct Memory Access (RDMA) technology enables kernel bypass, allowing applications to read and write remote memory without operating system involvement. The IETF RFC 8166 standards define the RDMA transport protocols. This transport requires specialized network interface cards (NICs) that support RoCE (RDMA over Converged Ethernet) or InfiniBand fabrics, along with lossless Ethernet configuration using Priority Flow Control (PFC) and Explicit Congestion Notification (ECN).

NVMe over TCP (NVMe/TCP) trades some performance for infrastructure simplicity. Latencies around 100 microseconds are significantly better than iSCSI while using standard Ethernet hardware. This transport runs over conventional TCP/IP networks without requiring RDMA-capable NICs, making it accessible for organizations with existing Ethernet infrastructure. The kernel TCP stack introduces more CPU overhead than RDMA’s kernel bypass, but modern CPUs handle this efficiently at data center speeds.

NVMe over Fibre Channel (NVMe/FC) enables NVMe-oF adoption in existing Fibre Channel SANs. Organizations with significant FC infrastructure investments can migrate to NVMe without replacing their network fabric. Performance sits between RDMA and TCP, with latencies typically in the 20-50 microsecond range depending on switch architecture.

The choice between transports depends on latency requirements, existing infrastructure, and budget constraints. HPC environments demanding single-digit microsecond latency justify RDMA’s hardware investment. General enterprise workloads benefit from TCP’s simplicity. FC deployments leverage existing infrastructure while modernizing the protocol layer.

NVMe-oF vs Traditional Storage Protocols

Comparing NVMe-oF against established storage protocols reveals its performance advantages and infrastructure tradeoffs:

FeatureNVMe-oF (RDMA)NVMe-oF (TCP)iSCSIFibre Channel
Protocol LayerNVMe over RDMA fabricsNVMe over TCP/IPSCSI over TCP/IPSCSI over FC
Typical Latency<10 μs~100 μs~200-500 μs~50-100 μs
Network RequirementsRoCE/InfiniBand hardwareStandard EthernetStandard EthernetDedicated FC switches
CPU OverheadVery Low (kernel bypass)Low-MediumMedium-HighLow
Max Throughput100+ Gbps capable25-100 Gbps10-25 Gbps typical32 Gbps (Gen 6 FC)
Queue Depth64K queues64K queues256 commands254 commands
Best Use CaseHPC, AI/ML, low-latency databasesCloud, general enterpriseLegacy compatibilityMission-critical SAN
Hardware InvestmentHigh (RDMA NICs)Low (standard NICs)LowHigh (FC infrastructure)

The queue depth difference is particularly significant. NVMe-oF’s 64,000 queue limit with 64,000 commands per queue enables massive I/O parallelism that matches modern network architectures and multi-core processors. Traditional protocols’ limited queue depths create bottlenecks when dozens of cores simultaneously issue I/O requests.

CPU efficiency matters at scale. RDMA’s kernel bypass means a 64-core server can issue millions of IOPS with minimal CPU consumption, leaving computational resources for application workloads. iSCSI’s TCP/IP processing and context switches consume noticeable CPU cycles, especially at high throughput levels.

Setting Up NVMe-oF: Target Configuration

Linux systems can function as NVMe-oF targets using the nvmet kernel subsystem. The configuration process uses the kernel’s configfs interface to define subsystems, namespaces, and ports.

First, ensure the required kernel modules are available and loaded:

# Install nvme-cli and kernel modules
sudo apt-get update
sudo apt-get install nvme-cli nvmetcli

# Load NVMe target modules
sudo modprobe nvmet
sudo modprobe nvmet-rdma  # For RDMA transport
sudo modprobe nvmet-tcp   # For TCP transport

# Verify modules loaded
lsmod | grep nvmet

Create an NVMe subsystem and namespace, mapping storage to a block device:

# Create subsystem
sudo mkdir -p /sys/kernel/config/nvmet/subsystems/nvme-subsys1
cd /sys/kernel/config/nvmet/subsystems/nvme-subsys1
echo 1 | sudo tee -a attr_allow_any_host > /dev/null

# Create namespace and map to block device
sudo mkdir namespaces/1
echo /dev/nvme0n1 | sudo tee -a namespaces/1/device_path > /dev/null
echo 1 | sudo tee -a namespaces/1/enable > /dev/null

Configure a TCP transport listener on a specific IP address and port:

# Create TCP port
sudo mkdir -p /sys/kernel/config/nvmet/ports/1
cd /sys/kernel/config/nvmet/ports/1
echo 192.168.1.100 | sudo tee -a addr_traddr > /dev/null
echo tcp | sudo tee -a addr_trtype > /dev/null
echo 4420 | sudo tee -a addr_trsvcid > /dev/null
echo ipv4 | sudo tee -a addr_adrfam > /dev/null

# Link subsystem to port
sudo ln -s /sys/kernel/config/nvmet/subsystems/nvme-subsys1 subsystems/

The target is now listening on port 4420 and will respond to discovery requests. For RDMA transports, the configuration is similar but specifies rdma as the transport type and uses the RDMA device name instead of an IP address.

Setting Up NVMe-oF: Initiator Connection

Client systems connecting to NVMe-oF targets need the nvme-cli userspace tools. The workflow involves discovering available targets, connecting to specific subsystems, and verifying the connection.

Discover available NVMe-oF targets on the network:

# Discover available targets
sudo nvme discover -t tcp -a 192.168.1.100 -s 4420

# Connect to specific subsystem
sudo nvme connect -t tcp -n nvme-subsys1 -a 192.168.1.100 -s 4420

# List connected NVMe devices
sudo nvme list

# Check device path (typically /dev/nvme1n1, /dev/nvme2n1, etc.)
lsblk | grep nvme

Once connected, the remote storage appears as a standard NVMe block device. Applications and file systems interact with it identically to local NVMe devices, with no modification required to existing software stacks.

Verify the connection status and view detailed subsystem information:

# Check NVMe controller status
sudo nvme list-subsys

# View connection details
sudo nvme show-regs /dev/nvme1

# Test performance with fio
sudo fio --name=randread --ioengine=libaio --iodepth=32 \
         --rw=randread --bs=4k --direct=1 --size=1G \
         --numjobs=4 --runtime=60 --group_reporting \
         --filename=/dev/nvme1n1

For persistent connections that survive reboots, create systemd service units that execute the connect commands during system startup. This automation ensures storage virtualization concepts integrate seamlessly with orchestration platforms.

Disconnecting from targets requires explicit commands:

# Disconnect specific device
sudo nvme disconnect -d /dev/nvme1

# Or disconnect all NVMe-oF devices
sudo nvme disconnect-all

Real-World Use Cases

NVMe-oF adoption spans multiple infrastructure scenarios where low-latency shared storage provides competitive advantages.

Hyperconverged infrastructure platforms use NVMe-oF to pool storage across cluster nodes. VMware vSAN can leverage NVMe-oF to share NVMe devices between hypervisors, providing VM storage that performs similarly to local flash. Kubernetes persistent volumes backed by NVMe-oF enable container storage solutions with microsecond latencies for stateful applications.

Database clusters requiring shared storage benefit significantly. Oracle Real Application Clusters (RAC) traditionally used Fibre Channel for shared disk architectures. NVMe-oF reduces storage latency by an order of magnitude, improving transaction throughput and reducing query latencies. SQL Server Failover Cluster Instances gain similar benefits when using NVMe-oF for quorum and database storage.

AI and machine learning training workflows involve reading massive datasets repeatedly across GPU clusters. Training a large language model might read hundreds of terabytes of training data multiple times. NVMe-oF enables all GPU nodes to access centralized datasets at speeds that prevent storage from bottlenecking GPU computation, particularly when using RDMA transports that match the low latency of GPU interconnects.

Video editing and media production studios use NVMe-oF for collaborative workflows. Multiple editors can access the same high-resolution footage simultaneously without copying files locally. The high throughput and low latency support real-time 8K video playback and editing, scenarios where traditional network storage introduces noticeable lag.

Financial services firms deploy NVMe-oF for trading systems where microseconds impact profitability. Market data feeds, order execution engines, and risk calculations access shared storage with single-digit microsecond latencies. The consistency of NVMe-oF latency matters as much as the average—predictable access times enable tighter transaction timing windows.

Common Misconceptions

Several misconceptions about NVMe-oF can lead to incorrect architectural decisions or unrealistic expectations.

Misconception: NVMe-oF always delivers local NVMe performance. Reality is more nuanced. NVMe-oF over RDMA comes very close to local performance, often within 5-10 microseconds. However, NVMe-oF over TCP adds ~100 microseconds compared to local access at ~10 microseconds. Network factors—congestion, switch architecture, cable quality—introduce variability. NVMe-oF eliminates protocol overhead, but physics still governs network latency.

Misconception: Standard Ethernet can’t support NVMe-oF effectively. While RDMA requires specialized hardware and lossless Ethernet configuration, NVMe/TCP works perfectly on standard Ethernet. A 25GbE or 100GbE network with quality switches delivers excellent performance for most workloads. TCP’s ~100 microsecond latency is still 2-5x better than iSCSI and sufficient for database, virtualization, and cloud workloads that don’t require single-digit microsecond latency.

Misconception: You need to replace all storage protocols immediately. NVMe-oF and traditional protocols coexist effectively. Many deployments use NVMe-oF for latency-sensitive tier-1 storage while keeping iSCSI or NFS for capacity-oriented tier-2 storage. Optimizing block storage performance involves matching protocols to workload requirements rather than wholesale replacement.

TBO Editorial

About the Author

TBO Editorial writes about the latest updates about products and services related to Technology, Business, Finance & Lifestyle. Do get in touch if you want to share any useful article with our community.