open vSwitch 简单软交换
背景
前面用 linux bridge 做软交换,功能没问题,性能看似还行,但其实有不少问题:
- 跑不到网卡极限
- cpu占用较高
- rdma 之类的高级特性缺失
cx5 介绍
https://www.nvidia.com/en-us/networking/ethernet/connectx-5/
NVIDIA® Mellanox® ConnectX®-5适配器提供先进的硬件卸载功能,可减少CPU资源消耗,并实现极高的数据包速率和吞吐量。这提高了数据中心基础设施的效率,并为Web 2.0、云、数据分析和存储平台提供了最高性能和最灵活的解决方案。
这里提到的 “先进的硬件卸载功能,可减少CPU资源消耗”,正是我需要的。
实现最高效率的主要特性:
-
NVIDIA RoCE技术封装了以太网上的数据包传输,并降低了CPU负载,从而为网络和存储密集型应用提供高带宽和低延迟的网络基础设施。
-
突破性的 NVIDIA ASAP² 技术通过将 Open vSwitch 数据路径从主机CPU卸载到适配器,提供创新的 SR-IOV 和 VirtIO 加速,从而实现极高的性能和可扩展性。
-
ConnectX 网卡利用 SR-IOV 来分离对虚拟化环境中物理资源和功能的访问。这减少了I/O开销,并允许网卡保持接近非虚拟化的性能。
关键特性:
- 每个端口最高可达100 Gb/s以太网
- 可靠传输上的自适应路由
- NVMe over Fabric(NVMf)目标卸载
- 增强的vSwitch / vRouter卸载
- NVGRE和VXLAN封装流量的硬件卸载
- 端到端QoS和拥塞控制
cx5 网卡资料:
https://network.nvidia.com/files/doc-2020/pb-connectx-5-en-card.pdf
硬件加速的基本知识
NVIDIA DPU白皮书:SR-IOV vs. VirtIO加速性能对比
https://aijishu.com/a/1060000000228117
搭建 open vswitch 软交换
安装 open vswitch
sudo apt install openvswitch-switch
查看Open vSwitch(OVS)的版本:
$ sudo ovs-vsctl show
cb860a93-1368-4d4f-be38-5cb9bbe5273d
ovs_version: "3.1.0"
和 openvswitch 模块的信息:
$ sudo modinfo openvswitch
filename: /lib/modules/6.1.0-33-amd64/kernel/net/openvswitch/openvswitch.ko
alias: net-pf-16-proto-16-family-ovs_ct_limit
alias: net-pf-16-proto-16-family-ovs_meter
alias: net-pf-16-proto-16-family-ovs_packet
alias: net-pf-16-proto-16-family-ovs_flow
alias: net-pf-16-proto-16-family-ovs_vport
alias: net-pf-16-proto-16-family-ovs_datapath
license: GPL
description: Open vSwitch switching datapath
depends: nf_conntrack,nsh,nf_nat,nf_defrag_ipv6,nf_conncount,libcrc32c
retpoline: Y
intree: Y
name: openvswitch
vermagic: 6.1.0-33-amd64 SMP preempt mod_unload modversions
sig_id: PKCS#7
signer: Debian Secure Boot CA
sig_key: 32:A0:28:7F:84:1A:03:6F:A3:93:C1:E0:65:C4:3A:E6:B2:42:26:43
sig_hashalgo: sha256
signature: 7E:3C:EA:A0:18:FE:81:6D:2C:A8:08:8A:1D:BD:D5:13:F1:5D:FE:C4:
06:2C:3B:4B:B2:4A:6D:1E:30:AD:65:CC:DB:87:73:F4:D7:D5:30:76:
D6:FF:E2:77:28:0A:AA:17:92:C4:C5:DF:EC:E8:E5:95:88:B4:62:36:
AF:BF:58:96:D0:C1:ED:A3:6D:23:18:DD:A0:CF:A6:2C:6E:71:B2:83:
AD:45:F0:59:8D:FB:6D:8C:FB:9D:80:4D:0A:16:0A:9B:CE:A3:61:60:
BC:85:9D:EE:70:4D:5A:62:6E:E3:33:C1:58:2B:C4:CE:36:27:C9:A5:
BB:6C:7D:F3:B5:74:C8:FA:C3:5F:E5:1B:28:46:55:7E:26:0E:2A:7A:
54:4B:DD:74:E8:EA:40:43:2B:62:F6:DC:13:A6:A3:C6:EA:BF:1B:41:
2B:0A:92:01:2D:57:02:EA:0A:24:C9:75:EB:F3:34:41:35:D7:31:67:
65:96:9B:3B:65:47:1B:2E:60:97:E9:C3:40:10:9F:C6:91:EB:C4:DB:
0C:D5:5D:9C:99:ED:DF:3C:CA:B3:DB:61:44:A9:A0:C5:1D:1D:C8:CF:
01:39:D6:F3:FE:81:2D:43:2C:DE:F7:A1:06:E5:EE:79:31:DC:41:83:
59:BC:30:42:BB:68:C8:27:AF:AE:69:30:51:2E:02:6A
检查 openvswitch 模块的正确加载:
$ lsmod | grep openvswitch
openvswitch 192512 0
nsh 16384 1 openvswitch
nf_conncount 24576 1 openvswitch
nf_nat 57344 1 openvswitch
nf_conntrack 188416 3 nf_nat,openvswitch,nf_conncount
nf_defrag_ipv6 24576 2 nf_conntrack,openvswitch
libcrc32c 16384 3 nf_conntrack,nf_nat,openvswitch
查看 openvswitch-switch 服务的运行情况:
$ sudo systemctl status openvswitch-switch
● openvswitch-switch.service - Open vSwitch
Loaded: loaded (/lib/systemd/system/openvswitch-switch.service; enabled; preset: enabled)
Active: active (exited) since Tue 2025-04-29 14:39:38 CST; 1min 54s ago
Process: 1816 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
Main PID: 1816 (code=exited, status=0/SUCCESS)
CPU: 340us
Apr 29 14:39:38 debian12 systemd[1]: Starting openvswitch-switch.service - Open vSwitch...
Apr 29 14:39:38 debian12 systemd[1]: Finished openvswitch-switch.service - Open vSwitch.
查看 ovsdb-server 服务:
$ sudo systemctl status ovsdb-server
● ovsdb-server.service - Open vSwitch Database Unit
Loaded: loaded (/lib/systemd/system/ovsdb-server.service; static)
Active: active (running) since Tue 2025-04-29 14:39:38 CST; 2min 28s ago
Process: 1718 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovs-vswitchd --no-monito>
Main PID: 1761 (ovsdb-server)
Tasks: 1 (limit: 19089)
Memory: 2.2M
CPU: 38ms
CGroup: /system.slice/ovsdb-server.service
└─1761 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:inf>
Apr 29 14:39:38 debian12 systemd[1]: Starting ovsdb-server.service - Open vSwitch Database Unit.>
Apr 29 14:39:38 debian12 ovs-ctl[1718]: Backing up database to /etc/openvswitch/conf.db.backup8.>
Apr 29 14:39:38 debian12 ovs-ctl[1718]: Compacting database.
Apr 29 14:39:38 debian12 ovs-ctl[1718]: Converting database schema.
Apr 29 14:39:38 debian12 ovs-ctl[1718]: Starting ovsdb-server.
Apr 29 14:39:38 debian12 ovs-vsctl[1762]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- >
Apr 29 14:39:38 debian12 ovs-vsctl[1767]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set>
Apr 29 14:39:38 debian12 ovs-ctl[1718]: Configuring Open vSwitch system IDs.
Apr 29 14:39:38 debian12 ovs-ctl[1718]: Enabling remote OVSDB managers.
Apr 29 14:39:38 debian12 systemd[1]: Started ovsdb-server.service - Open vSwitch Database Unit.
查看 ovs-vswitchd 服务:
$ sudo systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
Loaded: loaded (/lib/systemd/system/ovs-vswitchd.service; static)
Active: active (running) since Tue 2025-04-29 14:39:38 CST; 3min 28s ago
Process: 1771 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monito>
Main PID: 1810 (ovs-vswitchd)
Tasks: 1 (limit: 19089)
Memory: 2.1M
CPU: 17ms
CGroup: /system.slice/ovs-vswitchd.service
└─1810 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err ->
Apr 29 14:39:38 debian12 systemd[1]: Starting ovs-vswitchd.service - Open vSwitch Forwarding Uni>
Apr 29 14:39:38 debian12 ovs-ctl[1771]: Starting ovs-vswitchd.
Apr 29 14:39:38 debian12 ovs-ctl[1771]: Enabling remote OVSDB managers.
Apr 29 14:39:38 debian12 systemd[1]: Started ovs-vswitchd.service - Open vSwitch Forwarding Unit.
创建网桥
创建网桥前,先查看当前的网络接口:
ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp6s18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether bc:24:11:1c:d5:48 brd ff:ff:ff:ff:ff:ff
inet 192.168.3.227/24 brd 192.168.3.255 scope global dynamic enp6s18
valid_lft 43156sec preferred_lft 43156sec
inet6 fdfb:ddbe:c71b:0:be24:11ff:fe1c:d548/64 scope global dynamic mngtmpaddr
valid_lft forever preferred_lft forever
inet6 240e:3a1:5055:c180:be24:11ff:fe1c:d548/64 scope global dynamic mngtmpaddr
valid_lft 6989sec preferred_lft 2523sec
inet6 fe80::be24:11ff:fe1c:d548/64 scope link
valid_lft forever preferred_lft forever
3: enp1s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:5a:1f:ec brd ff:ff:ff:ff:ff:ff
4: enp1s0f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:5a:1f:ed brd ff:ff:ff:ff:ff:ff
当前有 3 个网口,分别是 enp6s18, enp1s0f0np0, enp1s0f1np1。enp6s18 是 cx4 25g 网卡,接交换机,enp1s0f0np0 和 enp1s0f1np1 是 cx5 100g 双头网卡,准备接另外两台机器。
创建 ovs 网桥:
sudo ovs-vsctl add-br br0
此时执行 ip addr
可以看到 ovs-system 和 br0:
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp6s18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether bc:24:11:1c:d5:48 brd ff:ff:ff:ff:ff:ff
inet 192.168.3.227/24 brd 192.168.3.255 scope global dynamic enp6s18
valid_lft 43027sec preferred_lft 43027sec
inet6 fdfb:ddbe:c71b:0:be24:11ff:fe1c:d548/64 scope global dynamic mngtmpaddr
valid_lft forever preferred_lft forever
inet6 240e:3a1:5055:c180:be24:11ff:fe1c:d548/64 scope global dynamic mngtmpaddr
valid_lft 6860sec preferred_lft 2394sec
inet6 fe80::be24:11ff:fe1c:d548/64 scope link
valid_lft forever preferred_lft forever
3: enp1s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:5a:1f:ec brd ff:ff:ff:ff:ff:ff
4: enp1s0f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:5a:1f:ed brd ff:ff:ff:ff:ff:ff
5: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 46:39:3e:eb:b0:0b brd ff:ff:ff:ff:ff:ff
6: br0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 8a:1f:49:5c:10:4d brd ff:ff:ff:ff:ff:ff
注意: 此时网桥的 mac 地址 8a:1f:49:5c:10:4d 和其他三个网卡都不一样。
目前网桥下还没有配置任何端口:
sudo ovs-vsctl show
cb860a93-1368-4d4f-be38-5cb9bbe5273d
Bridge br0
Port br0
Interface br0
type: internal
ovs_version: "3.1.0"
查看 br0 的详细信息:
sudo ovs-vsctl list br br0
_uuid : a345132c-00d9-4196-bb2c-2489c3669b47
auto_attach : []
controller : []
datapath_id : "00002e1345a39641"
datapath_type : ""
datapath_version : "<unknown>"
external_ids : {}
fail_mode : []
flood_vlans : []
flow_tables : {}
ipfix : []
mcast_snooping_enable: false
mirrors : []
name : br0
netflow : []
other_config : {}
ports : [1e053d35-c79f-4b5b-88cc-f5cba9519ca0]
protocols : []
rstp_enable : false
rstp_status : {}
sflow : []
status : {}
stp_enable : false
先修改 /etc/network/interfaces 文件,安全起见备份一份:
sudo cp /etc/network/interfaces /etc/network/interfaces.bak
sudo vi /etc/network/interfaces
内容为:
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
source /etc/network/interfaces.d/*
# The loopback network interface
auto lo
iface lo inet loopback
# The primary network interface
#allow-hotplug enp6s18
#iface enp6s18 inet dhcp
# Open vSwitch Configuration
# active br0 interface
auto br0
allow-ovs br0
# config IP for interface br0
iface br0 inet static
address 192.168.3.227
netmask 255.255.255.0
gateway 192.168.3.1
ovs_type OVSBridge
ovs_ports enp6s18
# attach enp6s18 to bridge br0
allow-br0 enp6s18
iface enp6s18 inet manual
ovs_bridge br0
ovs_type OVSPort
然后给网桥中添加端口,先加 enp6s18:
sudo ovs-vsctl add-port br0 enp6s18
执行完成后网络就不能用了。只能在 pve 控制台继续控制。
加入 enp6s18 后,网桥的 mac 地址变为 enp6s18 的 mac 地址:
sudo ovs-vsctl show
cb860a93-1368-4d4f-be38-5cb9bbe5273d
Bridge br0
Port br0
Interface br0
type: internal
Port enp6s18
Interface enp6s18
ovs_version: "3.1.0"
重启网络:
sudo systemctl restart networking
此时虚拟机的网络应该恢复了,可以重新 ssh 连接。查看此时网口信息:
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp6s18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovs-system state UP group default qlen 1000
link/ether bc:24:11:1c:d5:48 brd ff:ff:ff:ff:ff:ff
inet6 fe80::be24:11ff:fe1c:d548/64 scope link
valid_lft forever preferred_lft forever
3: enp1s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:5a:1f:ec brd ff:ff:ff:ff:ff:ff
4: enp1s0f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:5a:1f:ed brd ff:ff:ff:ff:ff:ff
9: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 46:39:3e:eb:b0:0b brd ff:ff:ff:ff:ff:ff
10: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether bc:24:11:1c:d5:48 brd ff:ff:ff:ff:ff:ff
inet 192.168.3.227/24 brd 192.168.3.255 scope global br0
valid_lft forever preferred_lft forever
inet6 fdfb:ddbe:c71b:0:be24:11ff:fe1c:d548/64 scope global dynamic mngtmpaddr
valid_lft forever preferred_lft forever
inet6 240e:3a1:5055:c180:be24:11ff:fe1c:d548/64 scope global dynamic mngtmpaddr
valid_lft 6477sec preferred_lft 2874sec
inet6 fe80::be24:11ff:fe1c:d548/64 scope link
valid_lft forever preferred_lft forever
注意: 此时 br0 的 mac 地址变为 enp6s18 的 mac 地址 bc:24:11:1c:d5:48。
检查一下是否可以访问外网:
ping 192.168.3.1
ping www.baidu.com
如果可以访问外网,则说明网络配置成功, openvswitch 网桥建立并可以工作。
往网桥中加入其他网口
目前网桥中只有 enp6s18 一个网口,只能用于访问外部访问,需要继续加入其他网口。
......
# config IP for interface br0
iface br0 inet static
address 192.168.3.227
netmask 255.255.255.0
gateway 192.168.3.1
ovs_type OVSBridge
ovs_ports enp6s18 enp1s0f0np0 enp1s0f1np1
# attach enp6s18 to bridge br0
allow-br0 enp6s18
iface enp6s18 inet manual
ovs_bridge br0
ovs_type OVSPort
# attach enp1s0f0np0 to bridge br0
allow-br0 enp1s0f0np0
iface enp1s0f0np0 inet manual
ovs_bridge br0
ovs_type OVSPort
# attach enp1s0f1np1 to bridge br0
allow-br0 enp1s0f1np1
iface enp1s0f1np1 inet manual
ovs_bridge br0
ovs_type OVSPort
重启网络
sudo systemctl restart networking
之后查看网桥信息:
sudo ovs-vsctl show
cb860a93-1368-4d4f-be38-5cb9bbe5273d
Bridge br0
Port br0
Interface br0
type: internal
Port enp1s0f0np0
Interface enp1s0f0np0
Port enp6s18
Interface enp6s18
Port enp1s0f1np1
Interface enp1s0f1np1
ovs_version: "3.1.0"
此时网桥中已经有三个网口,可以用来连接其他机器扮演交换机的角色了。
在另外一台机器的虚拟机中,直通 cx5 网卡,网线直连接到前面这台机器的 cx5 双头网卡上。注意去掉虚拟机的 vmbr 网卡,只使用 cx5 网卡。
修改网络配置文件:
sudo vi /etc/network/interfaces
添加如下内容(dhcp或者静态地址二选一):
# The loopback network interface
auto lo
iface lo inet loopback
# The primary network interface
allow-hotplug enp1s0np0
iface enp1s0np0 inet dhcp
#iface enp1s0np0 inet static
#address 192.168.3.130
#netmask 255.255.255.0
#gateway 192.168.3.1
#dns-nameservers 192.168.3.1
重启网络:
sudo systemctl restart networking
检查此时是否可以正常连接网络。
简单性能测试
用 iperf 简单测试一下性能:
# 在服务器端运行
iperf -s
# 在客户端运行
iperf -c 192.168.3.227 -i 1 -t 20 -P 4
测试结果:
[SUM] 11.0000-11.7642 sec 8.37 GBytes 94.1 Gbits/sec
[SUM] 0.0000-11.7642 sec 129 GBytes 94.1 Gbits/sec
可以看到,服务器端和客户端的带宽都达到了 94.1 Gbits/sec,接近 100G 的理论极限。
设置 MTU
目前网桥的 MTU 是 1500,可以设置为 9000:
sudo ip link set br0 mtu 9000
sudo ip link set enp1s0f0np0 mtu 9000
sudo ip link set enp1s0f1np1 mtu 9000
sudo ip link set enp6s18 mtu 9000
注意: 三块网卡都要设置,如果只设置 enp1s0f0np0 和 enp1s0f1np1, 没有设置 enp6s18, 速度不会有提升。
客户端也要设置 MTU:
sudo ip link set enp1s0np0 mtu 9000
重新测试,这次 iperf 测试出来的速度从 94.1 Gbits/sec 提升到了 98.8 Gbits/sec, 提升还是比较明显的。
[SUM] 0.0000-20.0004 sec 230 GBytes 98.8 Gbits/sec
当这个方式重启之后就 MTU 又回复成 1500 了。
修改网络配置文件:
sudo vi /etc/network/interfaces
服务器端修改,将 3 个网卡都设置为 mtu 9000 即可:
# attach enp6s18 to bridge br0
allow-br0 enp6s18
iface enp6s18 inet manual
ovs_bridge br0
ovs_type OVSPort
mtu 9000
# attach enp1s0f0np0 to bridge br0
allow-br0 enp1s0f0np0
iface enp1s0f0np0 inet manual
ovs_bridge br0
ovs_type OVSPort
mtu 9000
# attach enp1s0f1np1 to bridge br0
allow-br0 enp1s0f1np1
iface enp1s0f1np1 inet manual
ovs_bridge br0
ovs_type OVSPort
mtu 9000
客户端修改:
allow-hotplug enp1s0np0
iface enp1s0np0 inet static
address 192.168.3.205
netmask 255.255.255.0
gateway 192.168.3.1
dns-nameservers 192.168.3.1
mtu 9000
特别注意: 测试中发现,如果用 static 静态地址, 这样设置 MTU 9000 可以生效。但是如果用 dhcp 动态地址, 这样设置 MTU 9000 不会生效,会继续使用 1500。估计是 dhcp 动态地址获取的 MTU 是 1500。
重启网络:
sudo systemctl restart networking
sudo reboot
开启硬件卸载
sudo ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
sudo systemctl restart networking
sudo systemctl restart openvswitch-switch
https://infoloup.no-ip.org/openvswitch-debian12-creation/
➜ ~ sudo ip link set br0 mtu 9000 ➜ ~ sudo ip link set enp1s0f1np1 mtu 9000 ➜ ~ sudo ip link set enp1s0f0np0 mtu 9000 ➜ ~ sudo ip link set enp6s18 mtu 9000
https://docs.openstack.org/neutron/latest/admin/config-ovs-offload.html
https://www.openvswitch.org/support/ovscon2019/day2/0951-hw_offload_ovs_con_19-Oz-Mellanox.pdf
https://docs.nvidia.com/doca/archive/doca-v2.2.0/switching-support/index.html