nfs rdma 支持

在 debian12 下为 nfs 添加 rdma 支持

准备工作

前置条件

  • 已经配置好 nfs 服务器和客户端:参考上一章
  • 已经配置好硬盘:参考上一章

安装 MLNX_OFED 驱动

cx-5 100g网卡参考: https://skyao.io/learning-computer-hardware/nic/cx5-huawei-sp350/driver/debian12/

nfs 服务器和客户端都要安装网卡驱动。

网卡直连

为了避免 linux bridge 或者 ovs 的桥接,直接使用dac线连接100G网卡的网口,以直连的方式连接两台服务器(后续再尝试openvswitch)。

服务器端设置网卡地址:

sudo vi /etc/network/interfaces

增加内容:

allow-hotplug enp1s0f1np1
# iface enp1s0f1np1 inet dhcp
iface enp1s0f1np1 inet static
address 192.168.119.1
netmask 255.255.255.0
gateway 192.168.3.1
dns-nameservers 192.168.3.1

重启网卡:

sudo systemctl restart networking

客户端设置网卡地址:

sudo vi /etc/network/interfaces

增加内容:

allow-hotplug enp1s0np0
# iface enp1s0np0 inet dhcp
iface enp1s0np0 inet static
address 192.168.119.2
netmask 255.255.255.0
gateway 192.168.119.1
dns-nameservers 192.168.119.1

重启网卡:

sudo systemctl restart networking

验证两台机器可以互联:

ping 192.168.119.1
ping 192.168.119.2

验证两台机器之间的网速,服务器端自动 iperf 服务器端:

iperf -s 192.168.119.1

客户端 iperf 测试:

iperf -c 192.168.119.1 -i 1 -t 20 -P 4

测试出来的速度大概是 94.1 Gbits/sec,比较接近100G网卡的理论最大速度了:

[  4] 0.0000-20.0113 sec  39.2 GBytes  16.8 Gbits/sec
[  2] 0.0000-20.0111 sec  70.1 GBytes  30.1 Gbits/sec
[  3] 0.0000-20.0113 sec  70.8 GBytes  30.4 Gbits/sec
[  1] 0.0000-20.0111 sec  39.2 GBytes  16.8 Gbits/sec
[SUM] 0.0000-20.0007 sec   219 GBytes  94.1 Gbits/sec

此时 nfs 服务器端和客户端之间的网络连接已经没有问题了,可以开始配置 nfs 了。

由于启用了两块网卡,有时默认网络路由会出问题,导致无法访问外网。此时需要检查网络路由:

$ ip route

default via 192.168.3.1 dev enp1s0f1np1 onlink 
192.168.3.0/24 dev enp6s18 proto kernel scope link src 192.168.3.227 
192.168.119.0/24 dev enp1s0f1np1 proto kernel scope link src 192.168.119.1

可以看到默认路由是 192.168.3.1, 但使用的网卡是 enp1s0f1np1, 而不是 enp6s18。

此时需要手动设置默认路由:

sudo ip route del default
sudo ip route add default dev enp6s18

安装 nfsrdma

sudo cp ./mlnx-nfsrdma-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb /tmp/
cd /tmp
sudo apt-get install ./mlnx-nfsrdma-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb

输出为:

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'mlnx-nfsrdma-dkms' instead of './mlnx-nfsrdma-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb'
The following NEW packages will be installed:
  mlnx-nfsrdma-dkms
0 upgraded, 1 newly installed, 0 to remove and 2 not upgraded.
Need to get 0 B/71.2 kB of archives.
After this operation, 395 kB of additional disk space will be used.
Get:1 /home/sky/temp/mlnx-nfsrdma-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb mlnx-nfsrdma-dkms all 24.10.OFED.24.10.2.1.8.1-1 [71.2 kB]
Selecting previously unselected package mlnx-nfsrdma-dkms.
(Reading database ... 74820 files and directories currently installed.)
Preparing to unpack .../mlnx-nfsrdma-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb ...
Unpacking mlnx-nfsrdma-dkms (24.10.OFED.24.10.2.1.8.1-1) ...
Setting up mlnx-nfsrdma-dkms (24.10.OFED.24.10.2.1.8.1-1) ...
Loading new mlnx-nfsrdma-24.10.OFED.24.10.2.1.8.1 DKMS files...
First Installation: checking all kernels...
Building only for 6.1.0-31-amd64
Building for architecture x86_64
Building initial module for 6.1.0-31-amd64
Done.
Forcing installation of mlnx-nfsrdma

rpcrdma.ko:
Running module version sanity check.
 - Original module
 - Installation
   - Installing to /lib/modules/6.1.0-31-amd64/updates/dkms/

svcrdma.ko:
Running module version sanity check.
 - Original module
 - Installation
   - Installing to /lib/modules/6.1.0-31-amd64/updates/dkms/

xprtrdma.ko:
Running module version sanity check.
 - Original module
 - Installation
   - Installing to /lib/modules/6.1.0-31-amd64/updates/dkms/
depmod...

如果安装时最后有如下报错:

N: Download is performed unsandboxed as root as file '/home/sky/temp/mlnx-nfsrdma-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb' couldn't be accessed by user '_apt'. - pkgAcquire::Run (13: Permission denied)

最后的 Permission denied 提示是因为 _apt 用户无法直接访问下载的 .deb 文件(比如位于 /home/sky/temp/),需要先复制到 /tmp/ 目录下。

重启。

检查 DKMS 状态

确认 DKMS 模块已注册:

sudo dkms status | grep nfsrdma

输出:

mlnx-nfsrdma/24.10.OFED.24.10.2.1.8.1, 6.1.0-31-amd64, x86_64: installed

配置 nfs server(rdma)

为了确保安全,尤其是升级内核之后,先安装内核头文件并重新编译 nfsrdma 模块:

sudo apt install linux-headers-$(uname -r)
sudo dkms autoinstall

检查 RDMA 相关模块是否已加载

运行以下命令检查模块是否已加载:

lsmod | grep -E "rpcrdma|svcrdma|xprtrdma|ib_core|mlx5_ib"

预计输出:

xprtrdma               16384  0
svcrdma                16384  0
rpcrdma                94208  0
rdma_cm               139264  2 rpcrdma,rdma_ucm
mlx5_ib               495616  0
ib_uverbs             188416  2 rdma_ucm,mlx5_ib
ib_core               462848  9 rdma_cm,ib_ipoib,rpcrdma,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
sunrpc                692224  18 nfsd,rpcrdma,auth_rpcgss,lockd,nfs_acl
mlx5_core            2441216  1 mlx5_ib
mlx_compat             20480  15 rdma_cm,ib_ipoib,mlxdevm,rpcrdma,mlxfw,xprtrdma,iw_cm,svcrdma,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core

如果没有,手动加载模块(实际操作下来发现不手工加载也可以,系统会自动加载):

sudo modprobe rpcrdma
sudo modprobe svcrdma
sudo modprobe xprtrdma
sudo modprobe ib_core
sudo modprobe mlx5_ib

确认 DKMS 模块已注册:

sudo dkms status | grep nfsrdma

输出:

mlnx-nfsrdma/24.10.OFED.24.10.2.1.8.1, 6.1.0-31-amd64, x86_64: installed

检查内核日志:

sudo dmesg | grep rdma
[  178.512334] RPC: Registered rdma transport module.
[  178.512336] RPC: Registered rdma backchannel transport module.
[  178.515613] svcrdma: svcrdma is obsoleted, loading rpcrdma instead
[  178.552178] xprtrdma: xprtrdma is obsoleted, loading rpcrdma instead

测试 RDMA 连接

在服务器端执行:

ibstatus

我这里开启了一个网卡 mlx5_1:

Infiniband device 'mlx5_0' port 1 status:
	default gid:	 fe80:0000:0000:0000:1e34:daff:fe5a:1fec
	base lid:	 0x0
	sm lid:		 0x0
	state:		 1: DOWN
	phys state:	 3: Disabled
	rate:		 100 Gb/sec (4X EDR)
	link_layer:	 Ethernet

Infiniband device 'mlx5_1' port 1 status:
	default gid:	 fe80:0000:0000:0000:1e34:daff:fe5a:1fed
	base lid:	 0x0
	sm lid:		 0x0
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 100 Gb/sec (4X EDR)
	link_layer:	 Ethernet

或者执行 ibdev2netdev 显示:

$ ibdev2netdev

mlx5_0 port 1 ==> enp1s0f0np0 (Down)
mlx5_1 port 1 ==> enp1s0f1np1 (Up)

因此在 mlx5_1 上启动 ib_write_bw 测试:

ib_write_bw -d mlx5_1 -p 18515

显示:

************************************
* Waiting for client to connect... *
************************************

在客户端执行:

ibstatus

显示为:

Infiniband device 'mlx5_0' port 1 status:
	default gid:	 fe80:0000:0000:0000:8e2a:8eff:fe88:a136
	base lid:	 0x0
	sm lid:		 0x0
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 100 Gb/sec (4X EDR)
	link_layer:	 Ethernet

因此在 mlx5_0 上启动 ib_write_bw 测试:

ib_write_bw -d mlx5_0 -p 18515 192.168.119.1

客户端显示测试结果为:

---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON		Lock-free      : OFF
 ibv_wr* API     : ON		Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x009b PSN 0x57bb8f RKey 0x1fd4bc VAddr 0x007f0e0c8aa000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:119:02
 remote address: LID 0000 QPN 0x0146 PSN 0x860b7f RKey 0x23c6bc VAddr 0x007fb25aea4000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:119:01
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      5000             11030.78            11030.45		     0.176487
---------------------------------------------------------------------------------------

服务器端显示测试结果为:

---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_1
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON		Lock-free      : OFF
 ibv_wr* API     : ON		Using DDP      : OFF
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0146 PSN 0x860b7f RKey 0x23c6bc VAddr 0x007fb25aea4000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:119:01
 remote address: LID 0000 QPN 0x009b PSN 0x57bb8f RKey 0x1fd4bc VAddr 0x007f0e0c8aa000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:119:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      5000             11030.78            11030.45		     0.176487
---------------------------------------------------------------------------------------

测试成功,说明 RDMA 网络就绪。

开启 nfs server 的 RDMA 支持

nfs server 需要开启 rdma 支持:

sudo vi /etc/nfs.conf

修改 /etc/nfs.conf 文件的 [nfsd] 部分(备注:尽量要把 nfs 版本那几行也放开,限制为只提供4.2版本):

[nfsd]
# debug=0
# 线程改大一点,默认8太少了
threads=128
# host=
# port=0
# grace-time=90
# lease-time=90
# udp=n
# tcp=y
# 版本这几行一定要设置,注意只保留4.2版本
vers3=n
vers4=y
vers4.0=n
vers4.1=n
vers4.2=y
rdma=y
rdma-port=20049

然后重启 NFS 服务:

sudo systemctl restart nfs-server

然后检查 nfsd 的监听端口,验证 rdma 是否生效:

sudo cat /proc/fs/nfsd/portlist                    
rdma 20049
rdma 20049
tcp 2049
tcp 2049

如果看到 rdma 20049 端口,说明 rdma 配置成功。反之如果没有看到 rdma 20049 端口,说明 rdma 配置失败,需要检查前面的配置。

配置 nfs client(rdma)

检查 rdma 模块

确保客户端和 rdma 相关的模块都已经加载:

lsmod | grep -E "rpcrdma|svcrdma|xprtrdma|ib_core|mlx5_ib"

如果没有,手动加载模块(实际操作下来发现不手工加载也可以,系统会自动加载):

sudo modprobe rpcrdma
sudo modprobe svcrdma
sudo modprobe xprtrdma
sudo modprobe ib_core
sudo modprobe mlx5_ib

确认 DKMS 模块已注册:

sudo dkms status | grep nfsrdma

预计输出:

mlnx-nfsrdma/24.10.OFED.24.10.2.1.8.1, 6.1.0-31-amd64, x86_64: installed

检查内核日志:

sudo dmesg | grep rdma

预计输出:

[ 3273.613081] RPC: Registered rdma transport module.
[ 3273.613082] RPC: Registered rdma backchannel transport module.
[ 3695.887144] svcrdma: svcrdma is obsoleted, loading rpcrdma instead
[ 3695.923962] xprtrdma: xprtrdma is obsoleted, loading rpcrdma instead

配置 nfs 访问(rdma)

准备挂载点:

cd /mnt
sudo mkdir -p nfs-rdma

带 nfsrdma 方式的挂载 nfs:

sudo mount -t nfs -o rdma,port=20049,vers=4.2,nconnect=16 192.168.119.1:/exports/shared /mnt/nfs-rdma

挂载成功后,测试一下读写速度:

# nfs 写入100G数据,速度大概在 1.0 GB/s
sudo dd if=/dev/zero of=./test-100g.img bs=1G count=100 oflag=dsync
107374182400 bytes (107 GB, 100 GiB) copied, 104.569 s, 1.0 GB/s
# nfs 读取100G数据,速度大概在 2.2 GB/s
sudo dd if=./test-100g.img of=/dev/null bs=1G count=100 iflag=dsync
107374182400 bytes (107 GB, 100 GiB) copied, 47.4093 s, 2.2 GB/s

对比三者的速度,同一块铠侠 cd6 3.84T ssd 硬盘直通到虚拟机:

场景 100g 单文件写入 100g 单文件读取
直接读写硬盘 1.3 GB/s 4.0 GB/s
nfs 挂载(non-rdma) 1.0 GB/s 2.5 GB/s
nfs 挂载(rdma) 1.0 GB/s 2.2 GB/s