ubuntu – Is eBPF Compatible with Network Namespaces on Ubuntu 22.04?

linux-networkingUbuntu

I am experimenting with eBPF and network namespaces, because ultimately I want to filter traffic among Kubernetes containers. But I am finding that this doesn't work. I am using eBPF and user code from the usual test case I am working on https://github.com/tjcw/xdp-tutorial/tree/master/ebpf-filter , and when I run https://github.com/tjcw/xdp-tutorial/blob/master/ebpf-filter/runns.sh I get a failure as follows

libbpf: elf: skipping unrecognized data section(7) xdp_metadata
libxdp: No bpffs found at /sys/fs/bpf
libxdp: Compatibility check for dispatcher program failed: No such file or directory
libxdp: Falling back to loading single prog without dispatcher
libbpf: specified path /sys/fs/bpf/accept_map is not on BPF FS
libbpf: map 'accept_map': failed to auto-pin at '/sys/fs/bpf/accept_map': -22
libbpf: map 'accept_map': failed to create: Invalid argument(-22)
libbpf: failed to load object './af_xdp_kern.o'
ERROR:xdp_program__attach returns -22

This looks to be that eBPF is incompatible with network namespaces when using the kernel in Ubuntu 22.04; investigating the file system types shows

tjcw@tjcw-Standard-PC-Q35-ICH9-2009:~/workspace/xdp-tutorial/ebpf-filter$ sudo ip netns exec ns1 bash
root@tjcw-Standard-PC-Q35-ICH9-2009:/home/tjcw/workspace/xdp-tutorial/ebpf-filter# cd /sys/fs
root@tjcw-Standard-PC-Q35-ICH9-2009:/sys/fs# ls
bpf  cgroup  ecryptfs  ext4  fuse  pstore
root@tjcw-Standard-PC-Q35-ICH9-2009:/sys/fs# df .
Filesystem     1K-blocks  Used Available Use% Mounted on
ns1                    0     0         0    - /sys
root@tjcw-Standard-PC-Q35-ICH9-2009:/sys/fs# cd bpf
root@tjcw-Standard-PC-Q35-ICH9-2009:/sys/fs/bpf# df .
Filesystem     1K-blocks  Used Available Use% Mounted on
ns1                    0     0         0    - /sys
root@tjcw-Standard-PC-Q35-ICH9-2009:/sys/fs/bpf# ls
root@tjcw-Standard-PC-Q35-ICH9-2009:/sys/fs/bpf#

in a network namespace, and

tjcw@tjcw-Standard-PC-Q35-ICH9-2009:~/workspace/xdp-tutorial/ebpf-filter$ sudo bash
root@tjcw-Standard-PC-Q35-ICH9-2009:/home/tjcw/workspace/xdp-tutorial/ebpf-filter# df /sys/fs
Filesystem     1K-blocks  Used Available Use% Mounted on
sysfs                  0     0         0    - /sys
root@tjcw-Standard-PC-Q35-ICH9-2009:/home/tjcw/workspace/xdp-tutorial/ebpf-filter# df /sys/fs/bpf
Filesystem     1K-blocks  Used Available Use% Mounted on
bpf                    0     0         0    - /sys/fs/bpf
root@tjcw-Standard-PC-Q35-ICH9-2009:/home/tjcw/workspace/xdp-tutorial/ebpf-filter# ls -l /sys/fs/bpf
total 0
drwx------ 2 root root 0 Nov 11 12:30 snap
drwx------ 3 root root 0 Nov 11 15:17 xdp
root@tjcw-Standard-PC-Q35-ICH9-2009:/home/tjcw/workspace/xdp-tutorial/ebpf-filter#

in the root namespace. Is eBPF really incompatible with network namespaces, or have I misconfigured or misunderstood something ?
I am running in a VM which is in turn in a VM which is on my laptop machine. The Ubuntu 22.04 kernel is 5.15.0-52-generic .

It was suggested that I should mount the bpf file system inside the namespaces, so I made the start of my testing script look like

#!/bin/bash -x
ip netns add ns1
ip netns exec ns1 mount -t bpf bpf /sys/fs/bpf
ip netns exec ns1 df /sys/fs/bpf

ip netns add ns2
ip netns exec ns2 mount -t bpf bpf /sys/fs/bpf
ip netns exec ns2 df /sys/fs/bpf

but this didn't work for me; I get

+ ip netns add ns1
+ ip netns exec ns1 mount -t bpf bpf /sys/fs/bpf
+ ip netns exec ns1 df /sys/fs/bpf
Filesystem     1K-blocks  Used Available Use% Mounted on
ns1                    0     0         0    - /sys
+

It has also been suggested that I look at
https://github.com/cilium/cilium , https://isovalent.com/ and at the tests in the Linux source code tools/testing/selftests/bpf for inspiration. This is what I will try next.

After trying the initial answer (below) ,
I get a number of Destination Host Unreachable messages. My AF_DXP program loads, but pings do not go through. Here is my test case script

ip netns delete ns1
ip netns delete ns2
sleep 2

ip netns add ns1
ip netns add ns2

ip link add veth1 type veth peer name vpeer1
ip link add veth2 type veth peer name vpeer2

ip link set veth1 up
ip link set veth2 up

ip link set vpeer1 netns ns1
ip link set vpeer2 netns ns2

ip link add br0 type bridge
ip link set br0 up

ip link set veth1 master br0
ip link set veth2 master br0

ip addr add 10.10.0.1/16 dev br0

iptables -P FORWARD ACCEPT
iptables -F FORWARD


ip netns exec ns2 ./runns2.sh &
ip netns exec ns1 ./runns1.sh

wait

with helper script runns1.sh

#!/bin/bash -x

ip netns exec ns1 ip link set lo up

ip netns exec ns1 ip link set vpeer1 up

ip netns exec ns1 ip addr add 10.10.0.10/16 dev vpeer1
sleep 6
ip netns exec ns1 ping -c 10 10.10.0.20

and helper script runns2.sh

#!/bin/bash -x

ip link set lo up
ip link set vpeer2 up
ip addr add 10.10.0.20/16 dev vpeer2
ip link set dev vpeer2 xdpgeneric off
ip tuntap add mode tun tun0
ip link set dev tun0 down
ip link set dev tun0 addr 10.10.0.30/24
ip link set dev tun0 up

mount -t bpf bpf /sys/fs/bpf
df /sys/fs/bpf
ls -l /sys/fs/bpf
rm -f /sys/fs/bpf/accept_map /sys/fs/bpf/xdp_stats_map
if [[ -z "${LEAVE}" ]]
then 
  export LD_LIBRARY_PATH=/usr/local/lib
  ./af_xdp_user -S -d vpeer2 -Q 0 --filename ./af_xdp_kern.o &
  ns2_pid=$!
  sleep 20
  kill -INT ${ns2_pid}
fi 
wait

giving output

+ ip netns delete ns1
+ ip netns delete ns2
+ sleep 2
+ ip netns add ns1
+ ip netns add ns2
+ ip link add veth1 type veth peer name vpeer1
+ ip link add veth2 type veth peer name vpeer2
+ ip link set veth1 up
+ ip link set veth2 up
+ ip link set vpeer1 netns ns1
+ ip link set vpeer2 netns ns2
+ ip link add br0 type bridge
RTNETLINK answers: File exists
+ ip link set br0 up
+ ip link set veth1 master br0
+ ip link set veth2 master br0
+ ip addr add 10.10.0.1/16 dev br0
RTNETLINK answers: File exists
+ iptables -P FORWARD ACCEPT
+ iptables -F FORWARD
+ ip netns exec ns1 ./runns1.sh
+ ip netns exec ns2 ./runns2.sh
+ ip netns exec ns1 ip link set lo up
+ ip link set lo up
+ ip netns exec ns1 ip link set vpeer1 up
+ ip link set vpeer2 up
+ ip addr add 10.10.0.20/16 dev vpeer2
+ ip netns exec ns1 ip addr add 10.10.0.10/16 dev vpeer1
+ ip link set dev vpeer2 xdpgeneric off
+ ip tuntap add mode tun tun0
+ sleep 6
+ ip link set dev tun0 down
+ ip link set dev tun0 addr 10.10.0.30/24
"10.10.0.30/24" is invalid lladdr.
+ ip link set dev tun0 up
+ mount -t bpf bpf /sys/fs/bpf
+ df /sys/fs/bpf
Filesystem     1K-blocks  Used Available Use% Mounted on
bpf                    0     0         0    - /sys/fs/bpf
+ ls -l /sys/fs/bpf
total 0
+ rm -f /sys/fs/bpf/accept_map /sys/fs/bpf/xdp_stats_map
+ [[ -z '' ]]
+ export LD_LIBRARY_PATH=/usr/local/lib
+ LD_LIBRARY_PATH=/usr/local/lib
+ ns2_pid=3266
+ sleep 20
+ ./af_xdp_user -S -d vpeer2 -Q 0 --filename ./af_xdp_kern.o
main cfg.filename=./af_xdp_kern.o
main Opening program file ./af_xdp_kern.o
libbpf: elf: skipping unrecognized data section(8) .xdp_run_config
libbpf: elf: skipping unrecognized data section(9) xdp_metadata
main xdp_prog=0x56161aa476b0
main bpf_object=0x56161aa44490
libbpf: elf: skipping unrecognized data section(7) xdp_metadata
libbpf: elf: skipping unrecognized data section(7) xdp_metadata
+ ip netns exec ns1 ping -c 10 10.10.0.20
xsk_socket__create_shared_named_prog returns 0
bpf_map_update_elem(9,0x7ffef63436d0,0x7ffef63436e4,0)
bpf_map_update_elem returns 0
xsk_ring_prod__reserve returns 2048, XSK_RING_PROD__DEFAULT_NUM_DESCS is 2048
tun_read thread running
tun_read

0x0000 60 00 00 00 00 08 3a ff fe 80 00 00 00 00 00 00
0x0010 4c 45 17 e6 11 7c b7 4e ff 02 00 00 00 00 00 00
0x0020 00 00 00 00 00 00 00 02 85 00 50 41 00 00 00 00
addr=0x1fff100 len=90 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2049 free_count=0 frame=0x1fff100
addr=0x1ffe100 len=86 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2050 free_count=1 frame=0x1ffe100
addr=0x1ffd100 len=90 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2051 free_count=2 frame=0x1ffd100
addr=0x1ffc100 len=86 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2052 free_count=3 frame=0x1ffc100
addr=0x1ffb100 len=130 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2053 free_count=4 frame=0x1ffb100
addr=0x1ffa100 len=90 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2055 free_count=5 frame=0x1ffa100
addr=0x1ff9100 len=70 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2055 free_count=6 frame=0x1ff9100
addr=0x1ff8100 len=90 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2056 free_count=7 frame=0x1ff8100
addr=0x1ff7100 len=107 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2057 free_count=8 frame=0x1ff7100
addr=0x1ff6100 len=110 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2058 free_count=9 frame=0x1ff6100
addr=0x1ff5100 len=90 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2059 free_count=10 frame=0x1ff5100
addr=0x1ff4100 len=214 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2060 free_count=11 frame=0x1ff4100
addr=0x1ff3100 len=214 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2061 free_count=12 frame=0x1ff3100
addr=0x1ff2100 len=214 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2062 free_count=13 frame=0x1ff2100
addr=0x1ff1100 len=90 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2064 free_count=14 frame=0x1ff1100
addr=0x1ff0100 len=70 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2064 free_count=15 frame=0x1ff0100
addr=0x1fef100 len=202 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2065 free_count=16 frame=0x1fef100
addr=0x1fee100 len=107 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2066 free_count=17 frame=0x1fee100
addr=0x1fed100 len=90 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2067 free_count=18 frame=0x1fed100
addr=0x1fec100 len=202 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2068 free_count=19 frame=0x1fec100
tun_read

0x0000 60 00 00 00 00 08 3a ff fe 80 00 00 00 00 00 00
0x0010 4c 45 17 e6 11 7c b7 4e ff 02 00 00 00 00 00 00
0x0020 00 00 00 00 00 00 00 02 85 00 50 41 00 00 00 00
addr=0x1feb100 len=107 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2069 free_count=20 frame=0x1feb100
addr=0x1fea100 len=70 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2070 free_count=21 frame=0x1fea100
addr=0x1fe9100 len=202 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2071 free_count=22 frame=0x1fe9100
addr=0x1fe8100 len=70 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2072 free_count=23 frame=0x1fe8100
addr=0x1fe7100 len=42 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2073 free_count=24 frame=0x1fe7100
addr=0x1fe6100 len=42 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2074 free_count=25 frame=0x1fe6100
addr=0x1fe5100 len=42 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2075 free_count=26 frame=0x1fe5100
addr=0x1fe4100 len=107 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocaPING 10.10.0.20 (10.10.0.20) 56(84) bytes of data.
From 10.10.0.10 icmp_seq=1 Destination Host Unreachable
From 10.10.0.10 icmp_seq=2 Destination Host Unreachable
From 10.10.0.10 icmp_seq=3 Destination Host Unreachable
From 10.10.0.10 icmp_seq=4 Destination Host Unreachable
From 10.10.0.10 icmp_seq=5 Destination Host Unreachable
From 10.10.0.10 icmp_seq=6 Destination Host Unreachable
From 10.10.0.10 icmp_seq=7 Destination Host Unreachable
From 10.10.0.10 icmp_seq=8 Destination Host Unreachable
From 10.10.0.10 icmp_seq=9 Destination Host Unreachable
From 10.10.0.10 icmp_seq=10 Destination Host Unreachable

--- 10.10.0.20 ping statistics ---
10 packets transmitted, 0 received, +10 errors, 100% packet loss, time 9209ms
pipe 4
+ wait
+ kill -INT 3266
+ wait
tion_count=2076 free_count=27 frame=0x1fe4100
addr=0x1fe3100 len=42 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2077 free_count=28 frame=0x1fe3100
addr=0x1fe2100 len=42 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2078 free_count=29 frame=0x1fe2100
tun_read

0x0000 60 00 00 00 00 08 3a ff fe 80 00 00 00 00 00 00
0x0010 4c 45 17 e6 11 7c b7 4e ff 02 00 00 00 00 00 00
0x0020 00 00 00 00 00 00 00 02 85 00 50 41 00 00 00 00
addr=0x1fe1100 len=42 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2079 free_count=30 frame=0x1fe1100
addr=0x1fe0100 len=42 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2080 free_count=31 frame=0x1fe0100
addr=0x1fdf100 len=70 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2082 free_count=32 frame=0x1fdf100
addr=0x1fde100 len=70 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2082 free_count=33 frame=0x1fde100
addr=0x1fdd100 len=42 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2083 free_count=34 frame=0x1fdd100
addr=0x1fdc100 len=42 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2084 free_count=35 frame=0x1fdc100
addr=0x1fdb100 len=42 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2085 free_count=36 frame=0x1fdb100
addr=0x1fda100 len=107 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2086 free_count=37 frame=0x1fda100
addr=0x1fd9100 len=42 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2087 free_count=38 frame=0x1fd9100
addr=0x1fd8100 len=42 transmitted=0
xsk_free_umem_frame xsk=0x56161aa570f0 allocation_count=2088 free_count=39 frame=0x1fd8100

I think I have seen the script run as expected once (first packet gets lost because it is treated as a martian, subsequent packets go through) but I am unable to reproduce this behaviour even running immediately after a reboot.

Thanks for all the help you can give,

Best Answer

An answer from a colleague :

> Something isn't working as I expect; it looks like the bpf file system does not mount in the network namespace. The start of my script is
> #!/bin/bash -x
> ip netns add ns1
> ip netns exec ns1 mount -t bpf bpf /sys/fs/bpf
> ip netns exec ns1 df /sys/fs/bpf
>
> and I think I should expect to see 'bpf' as the filesystem type for the 'df' command. However what I actually get is
> + ip netns add ns1
> + ip netns exec ns1 mount -t bpf bpf /sys/fs/bpf
> + ip netns exec ns1 df /sys/fs/bpf
> Filesystem     1K-blocks  Used Available Use% Mounted on
> ns1                    0     0         0    - /sys
> +
> and then my attempt to run the afxdp test case process fails as before. Any idea what I am doing wrong ?

Well, the problem is that 'ip' sets up a new mount namespace every thing
you do 'ip netns exec'. So the BPF mount doesn't stay across different
'exec' invocations.

This is a bit of an impedance mismatch between libxdp and 'ip netns'.
You can get around it by having multiple commands in a single script and
executing that script with 'ip netns exec', instead of doing multiple
'exec' commands.

One thing to be aware of here is that the fact that the mount goes away
also means all the pinned programs disappear; so if you load an XDP
program with libxdp, then exit the netns, and go back in, libxdp may
have trouble unloading the program. If you're running a single
application that uses AF_XDP, this should be much of an issue, though.

I guess we could also teach libxdp to try to mount the bpffs if it's not
already there...