这是本节的多页打印视图。 点击此处打印.

返回本页常规视图.

Mellanox 华为sp350 100G网卡

华为sp350/Mellanox cx5 100g单头网卡

1 - 华为sp350刷新固件

给华为sp350网卡刷新固件

目前华为sp350网卡无法刷新官方固件,暂时放弃。

附录

debian 12 下安装 mft

Debian 12 下安装 mft 的方式参考 cx4121a 一节,方法是类似的。

但是,不知道为什么安装完成后无法启动

$ sudo mst start Starting MST (Mellanox Software Tools) driver set Loading MST PCI modulemodprobe: ERROR: could not insert 'mst_pci': Key was rejected by service - Failure: 1 Loading MST PCI configuration modulemodprobe: ERROR: could not insert 'mst_pciconf': Key was rejected by service - Failure: 1 Create devices mst_pci driver not found Unloading MST PCI module (unused) - Success Unloading MST PCI configuration module (unused) - Success

反复测试过:

  • 不管是用 debian12 自带的驱动,还是安装官方最新驱动,mst 都无法启动。

  • mft mft-4.26 和 mft-4.27 两个版本都测试过,一样的问题

windows10 下删除 rom

为了开机更快一些,在 windows 下删除网卡的 rom,方法是通用的:

flint -d /dev/mst/mt4117_pciconf0 --allow_rom_change drom

执行过程很慢:

-I- Preparing to remove ROM ... Removing ROM image - OK # 这一步要1-2分钟 Restoring signature - OK

2 - cx4121a 驱动

为 Mellanox MCX4121A 网卡安装驱动

2.1 - debian 12 安装驱动

在debian12上安装华为sp350网卡的驱动

2024-09-09 更新: debian12 已经升级到 12.5 版本,网卡驱动版本为最新的 24.07-0.6.1.0 版本。

准备工作

查看默认驱动

这是debian12自带的默认驱动情况:

$lsmod | grep mlx mlx5_ib 405504 0 ib_uverbs 172032 1 mlx5_ib ib_core 438272 2 ib_uverbs,mlx5_ib mlx5_core 1683456 1 mlx5_ib mlxfw 36864 1 mlx5_core psample 20480 1 mlx5_core pci_hyperv_intf 16384 1 mlx5_core

mlx5_core 的详细信息:

$ modinfo mlx5_core filename: /lib/modules/6.1.0-20-amd64/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko license: Dual BSD/GPL description: Mellanox 5th generation network adapters (ConnectX series) core driver author: Eli Cohen <eli@mellanox.com> alias: auxiliary:mlx5_core.eth alias: pci:v000015B3d0000A2DFsv*sd*bc*sc*i* alias: pci:v000015B3d0000A2DCsv*sd*bc*sc*i* alias: pci:v000015B3d0000A2D6sv*sd*bc*sc*i* alias: pci:v000015B3d0000A2D3sv*sd*bc*sc*i* alias: pci:v000015B3d0000A2D2sv*sd*bc*sc*i* alias: pci:v000015B3d00001023sv*sd*bc*sc*i* alias: pci:v000015B3d00001021sv*sd*bc*sc*i* alias: pci:v000015B3d0000101Fsv*sd*bc*sc*i* alias: pci:v000015B3d0000101Esv*sd*bc*sc*i* alias: pci:v000015B3d0000101Dsv*sd*bc*sc*i* alias: pci:v000015B3d0000101Csv*sd*bc*sc*i* alias: pci:v000015B3d0000101Bsv*sd*bc*sc*i* alias: pci:v000015B3d0000101Asv*sd*bc*sc*i* alias: pci:v000015B3d00001019sv*sd*bc*sc*i* alias: pci:v000015B3d00001018sv*sd*bc*sc*i* alias: pci:v000015B3d00001017sv*sd*bc*sc*i* alias: pci:v000015B3d00001016sv*sd*bc*sc*i* alias: pci:v000015B3d00001015sv*sd*bc*sc*i* alias: pci:v000015B3d00001014sv*sd*bc*sc*i* alias: pci:v000015B3d00001013sv*sd*bc*sc*i* alias: pci:v000015B3d00001012sv*sd*bc*sc*i* alias: pci:v000015B3d00001011sv*sd*bc*sc*i* alias: auxiliary:mlx5_core.eth-rep depends: psample,pci-hyperv-intf,mlxfw retpoline: Y intree: Y name: mlx5_core vermagic: 6.1.0-20-amd64 SMP preempt mod_unload modversions sig_id: PKCS#7 signer: Debian Secure Boot CA sig_key: 32:A0:28:7F:84:1A:03:6F:A3:93:C1:E0:65:C4:3A:E6:B2:42:26:43 sig_hashalgo: sha256 signature: 86:53:46:C0:77:7E:22:E0:2A:B3:23:32:E3:87:DA:7C:94:3A:B1:1B: 5A:92:14:41:17:78:2B:25:A9:9E:B9:9E:0C:F7:1C:2E:30:F3:D3:96: 44:27:A8:74:A3:7D:2F:83:7D:2B:F4:A7:4E:C5:00:98:0B:56:15:0C: DF:53:B8:01:66:B2:C0:9D:C9:DD:2C:E3:A6:BA:91:E0:B0:11:37:DF: D7:32:B9:DA:B4:B5:B8:FB:CA:8F:21:46:91:05:28:C1:F1:D9:1B:C5: C7:B4:67:58:D9:29:B2:43:84:A0:5F:AD:01:E8:41:71:18:08:18:83: 0E:F3:E7:88:32:08:46:3B:42:AF:A9:8F:63:E4:45:5D:45:16:E8:48: 84:67:02:C1:A1:AF:A3:71:35:4C:E5:12:83:4D:05:BD:BE:14:01:F6: E5:19:E2:3A:60:9D:0A:D1:C6:B7:E6:CE:FE:8C:7C:0F:B5:01:49:08: D9:BB:CE:16:4C:5D:18:CC:61:ED:D3:D4:CA:2E:44:A0:4A:2B:59:DC: 2B:30:06:27:8E:25:7E:0D:4B:00:7B:4E:2A:7F:65:87:22:B0:1A:BC: 75:C5:83:47:21:92:D9:84:F5:FC:89:5B:3F:5A:9F:6E:16:FC:38:C4: 2F:5C:9C:BF:7A:AB:F3:91:32:C6:CA:05:50:5C:27:10 parm: debug_mask:debug mask: 1 = dump cmd data, 2 = dump cmd exec time, 3 = both. Default=0 (uint) parm: prof_sel:profile selector. Valid range 0 - 2 (uint)

默认驱动的版本为 6.1.0-20:

modinfo mlx5_core | grep version vermagic: 6.1.0-20-amd64 SMP preempt mod_unload modversions

下载驱动

下载地址:

https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/

选择对应的 debian 版本,最新的 24.07-0.6.1.0 版本已经提供对 debian 12.5 版本的支持了:

mlnx_ofed_download

下载得到 MLNX_OFED_LINUX-24.07-0.6.1.0-debian12.5-x86_64.tgz 文件, scp 传到 debian 12 下。

关闭 secure boot

需要在物理机或者虚拟机的 bios 中关闭了 secure boot,会和最新 24.07-0.6.1.0 版本的 mlnx_ofed 驱动冲突。

pve虚拟机中如图:

disable-secure-boot

否则安装最新版本的驱动后会报错而导致网卡无法使用。

安装驱动

su root tar MLNX_OFED_LINUX-24.07-0.6.1.0-debian12.5-x86_64.tgz cd MLNX_OFED_LINUX-24.07-0.6.1.0-debian12.5-x86_64

设置 PATH 否则默认 PATH 会找不到某些重要的命令而失败:

export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

配置代理,加快下载速度:

export all_proxy=socks5://192.168.0.1:7891;export http_proxy=http://192.168.0.1:7890;export https_proxy=http://192.168.0.1:7890;export no_proxy=127.0.0.1,localhost,local,.local,.lan,192.168.0.0/16,10.0.0.0/16

开始安装:

./mlnxofedinstall --without-fw-update --with-nvmf --with-nfsrdma --ovs-dpdk

注意对于某些版本的驱动要加 --distro debian12.1, 否则可能会报错:

Error: The current MLNX_OFED_LINUX is intended for debian12.1

这是因为我安装debian12时版本已经是 12.5了,而最新的 24.07-0.6.1.0 驱动已经有针对 debian 12.5 的打包:

./mlnxofedinstall --print-distro debian12.5

--with-nvmf --with-nfsrdma --ovs-dpdk 这三个参数是可选的,我增加这三个参数主要是为了要学习测试这几个功能。

安装过程如下(例子还是24.01-0.3.3.1的,24.07-0.6.1.0 版本类似):

$./mlnxofedinstall --without-fw-update --with-nvmf --with-nfsrdma --ovs-dpdk Logs dir: /tmp/MLNX_OFED_LINUX.1071.logs General log file: /tmp/MLNX_OFED_LINUX.1071.logs/general.log Below is the list of MLNX_OFED_LINUX packages that you have chosen (some may have been added by the installer due to package dependencies): ofed-scripts mlnx-tools mlnx-ofed-kernel-utils mlnx-ofed-kernel-dkms iser-dkms isert-dkms srp-dkms mlnx-nvme-dkms rdma-core libibverbs1 ibverbs-utils ibverbs-providers libibverbs-dev libibverbs1-dbg libibumad3 libibumad-dev ibacm librdmacm1 rdmacm-utils librdmacm-dev mstflint ibdump libibmad5 libibmad-dev libopensm opensm opensm-doc libopensm-devel libibnetdisc5 infiniband-diags mft kernel-mft-dkms perftest ibutils2 ibsim ibsim-doc ucx sharp hcoll knem-dkms knem openmpi mpitests srptools mlnx-ethtool mlnx-iproute2 rshim ibarr libopenvswitch openvswitch-common openvswitch-switch This program will install the MLNX_OFED_LINUX package on your machine. Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be removed. Those packages are removed due to conflicts with MLNX_OFED_LINUX, do not reinstall them. Do you want to continue?[y/N]:y Checking SW Requirements... One or more required packages for installing MLNX_OFED_LINUX are missing. Attempting to install the following missing packages: libipsec-mb1 uuid-runtime libunwind8 libunbound8 libpcap0.8 Removing old packages... Uninstalling the previous version of MLNX_OFED_LINUX Installing new packages Installing ofed-scripts-24.01.OFED.24.01.0.3.3... Installing mlnx-tools-24.01.0... Installing mlnx-ofed-kernel-utils-24.01.OFED.24.01.0.3.3.1... Installing mlnx-ofed-kernel-dkms-24.01.OFED.24.01.0.3.3.1... Installing iser-dkms-24.01.OFED.24.01.0.3.3.1... Installing isert-dkms-24.01.OFED.24.01.0.3.3.1... Installing srp-dkms-24.01.OFED.24.01.0.3.3.1... Installing mlnx-nvme-dkms-24.01.OFED.24.01.0.3.3.1... Installing rdma-core-2307mlnx47... Installing libibverbs1-2307mlnx47... Installing ibverbs-utils-2307mlnx47... Installing ibverbs-providers-2307mlnx47... Installing libibverbs-dev-2307mlnx47... Installing libibverbs1-dbg-2307mlnx47... Installing libibumad3-2307mlnx47... Installing libibumad-dev-2307mlnx47... Installing ibacm-2307mlnx47... Installing librdmacm1-2307mlnx47... Installing rdmacm-utils-2307mlnx47... Installing librdmacm-dev-2307mlnx47... Installing mstflint-4.16.1... Installing ibdump-6.0.0... Installing libibmad5-2307mlnx47... Installing libibmad-dev-2307mlnx47... Installing libopensm-5.18.0.MLNX20240128.3f266a48... Installing opensm-5.18.0.MLNX20240128.3f266a48... Installing opensm-doc-5.18.0.MLNX20240128.3f266a48... Installing libopensm-devel-5.18.0.MLNX20240128.3f266a48... Installing libibnetdisc5-2307mlnx47... Installing infiniband-diags-2307mlnx47... Installing mft-4.27.0... Installing kernel-mft-dkms-4.27.0.83... Installing perftest-24.01.0... Installing ibutils2-2.1.1... Installing ibsim-0.12... Installing ibsim-doc-0.12... Installing ucx-1.16.0... Installing sharp-3.6.0.MLNX20240128.e669b4e8... Installing hcoll-4.8.3227... Installing knem-dkms-1.1.4.90mlnx3... Installing knem-1.1.4.90mlnx3... Installing openmpi-4.1.7a1... Installing mpitests-3.2.22... Installing srptools-2307mlnx47... Installing mlnx-ethtool-6.4... Installing mlnx-iproute2-6.4.0... Installing rshim-2.0.19... Installing ibarr-0.1.3... Installing libopenvswitch-2.17.8... Installing openvswitch-common-2.17.8... Installing openvswitch-switch-2.17.8... Selecting previously unselected package mlnx-fw-updater. (Reading database ... 101192 files and directories currently installed.) Preparing to unpack .../mlnx-fw-updater_24.01-0.3.3.1_amd64.deb ... Unpacking mlnx-fw-updater (24.01-0.3.3.1) ... Setting up mlnx-fw-updater (24.01-0.3.3.1) ... Added 'RUN_FW_UPDATER_ONBOOT=no to /etc/infiniband/openib.conf Skipping FW update. Device (01:00.0): 01:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] Link Width: x16 PCI Link Speed: 8GT/s Installation passed successfully To load the new driver, run: /etc/init.d/openibd restart Note: In order to load the new nvme-rdma and nvmet-rdma modules, the nvme module must be reloaded.

重启之后, 24.01-0.3.3.1 和之前的版本就可以正常工作了。

报错:pci_hp_register failed

但最新的 24.07-0.6.1.0 版本会报错, ip addr 会发现 cx5 网卡不见了。

dmesg 查看,会发现有这样的错误提示:

pci_hp_register failed with error -16

如果升级 linix 内核,则会在升级时提示 “Your system has UEFI Secure Boot enabled”:

我就是根据这个线索,去虚拟机的 bios 中关闭了 secure boot:

重启就正常了。

安装后处理

取消 openibd 的自动启动

安装完成后,重启之前,取消 openibd 的开机自动启动:

sudo systemctl disable openibd

输出为:

Synchronizing state of openibd.service with SysV service script with /lib/systemd/systemd-sysv-install. Executing: /lib/systemd/systemd-sysv-install disable openibd Removed "/etc/systemd/system/sysinit.target.wants/openibd.service".

反正目前也只用到 eth 模式,不用 ib 模式。

参考:

查看安装后的驱动信息

$ lsmod | grep mlx mlx5_ib 479232 0 ib_uverbs 184320 2 rdma_ucm,mlx5_ib ib_core 454656 8 rdma_cm,ib_ipoib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm mlx5_core 2420736 1 mlx5_ib mlxfw 36864 1 mlx5_core psample 20480 1 mlx5_core mlxdevm 180224 1 mlx5_core mlx_compat 20480 11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core tls 135168 1 mlx5_core pci_hyperv_intf 16384 1 mlx5_core

mlx5_core 的详细信息:

$ modinfo mlx5_core filename: /lib/modules/6.1.0-25-amd64/updates/dkms/mlx5_core.ko alias: auxiliary:mlx5_core.eth-rep alias: auxiliary:mlx5_core.eth basedon: Korg 6.8-rc4 version: 24.07-0.6.1 license: Dual BSD/GPL description: Mellanox 5th generation network adapters (ConnectX series) core driver author: Eli Cohen <eli@mellanox.com> srcversion: 769E8732BF9FAF2E580D2BC alias: pci:v000015B3d0000A2DFsv*sd*bc*sc*i* alias: pci:v000015B3d0000A2DCsv*sd*bc*sc*i* alias: pci:v000015B3d0000A2D6sv*sd*bc*sc*i* alias: pci:v000015B3d0000A2D3sv*sd*bc*sc*i* alias: pci:v000015B3d0000A2D2sv*sd*bc*sc*i* alias: pci:v000015B3d00001023sv*sd*bc*sc*i* alias: pci:v000015B3d00001021sv*sd*bc*sc*i* alias: pci:v000015B3d0000101Fsv*sd*bc*sc*i* alias: pci:v000015B3d0000101Esv*sd*bc*sc*i* alias: pci:v000015B3d0000101Dsv*sd*bc*sc*i* alias: pci:v000015B3d0000101Csv*sd*bc*sc*i* alias: pci:v000015B3d0000101Bsv*sd*bc*sc*i* alias: pci:v000015B3d0000101Asv*sd*bc*sc*i* alias: pci:v000015B3d00001019sv*sd*bc*sc*i* alias: pci:v000015B3d00001018sv*sd*bc*sc*i* alias: pci:v000015B3d00001017sv*sd*bc*sc*i* alias: pci:v000015B3d00001016sv*sd*bc*sc*i* alias: pci:v000015B3d00001015sv*sd*bc*sc*i* alias: pci:v000015B3d00001014sv*sd*bc*sc*i* alias: pci:v000015B3d00001013sv*sd*bc*sc*i* alias: auxiliary:mlx5_core.sf depends: mlxdevm,mlx_compat,tls,pci-hyperv-intf,psample,mlxfw retpoline: Y name: mlx5_core vermagic: 6.1.0-25-amd64 SMP preempt mod_unload modversions sig_id: PKCS#7 signer: DKMS module signing key sig_key: 25:DA:47:F2:9F:35:E2:08:53:6F:AD:D7:4E:06:E8:E0:59:C8:1E:89 sig_hashalgo: sha256 signature: 01:97:E6:8D:53:AD:D9:38:E0:D5:8C:00:B9:8F:EB:C6:2E:5F:DF:7F: C5:DB:AA:62:85:81:36:F1:8E:E3:82:2E:33:63:9B:E6:57:07:2D:DC: 43:51:C4:04:15:AA:C9:B7:A1:02:58:1F:74:EE:2A:27:91:B4:A2:23: FE:25:31:06:62:1D:D0:2D:A6:55:C5:B2:CB:A4:25:0B:DA:24:18:81: 0E:E3:7A:76:EC:5A:C3:E0:A7:E5:75:44:4C:BD:3C:E1:AD:55:EA:F1: 6A:E7:B4:7A:03:A6:DD:32:10:5B:A4:A0:74:EC:02:E0:D1:33:65:E2: 17:4C:16:01:54:5D:60:C5:AF:0E:4C:4A:73:4B:FB:C8:BB:0A:00:AB: 80:05:82:E2:9A:72:58:F6:0A:18:21:E2:3E:57:91:9A:2D:31:DC:04: 55:A0:3E:B2:62:7D:F4:F1:9A:8C:B6:9F:88:27:A3:92:07:14:57:28: D4:61:4C:B2:EE:70:A4:DF:90:C9:F3:0C:85:43:8F:C2:C0:C1:75:77: E6:76:CD:26:B6:6D:F7:13:10:B0:EC:CA:9F:B8:31:3E:C3:A3:FA:ED: 3E:CB:55:D6:7D:0E:6A:32:66:1E:C0:95:E1:00:F3:47:DA:20:0D:1E: 68:DF:1F:4E:4C:99:97:D6:55:48:2B:65:E6:47:1A:35 parm: num_of_groups:Eswitch offloads number of big groups in FDB table. Valid range 1 - 1024. Default 15 (uint) parm: debug_mask:debug mask: 1 = dump cmd data, 2 = dump cmd exec time, 3 = both. Default=0 (uint) parm: prof_sel:profile selector. Valid range 0 - 4 (uint)

更新之后 mlx5_core 的版本从默认升级到 24.01-0.3.3 :

$ modinfo mlx5_core | grep version version: 24.01-0.3.3 srcversion: 59290B9C495B89FC195B001 vermagic: 6.1.0-20-amd64 SMP preempt mod_unload modversions

2.2 - windows 驱动

在windows上安装华为sp350网卡的驱动

安装方式和 cx4121a 网卡方式一致。