XDP(eXpress Data Path)

대문 / 프로그래밍, 네트웍 / XDP(eXpress Data Path)

XDP(eXpress Data Path)

작성자
조재혁(minzkn@minzkn.com)
고친과정
2022년 1월 17일 : 처음씀
2022년 2월 4일 : 몇가지 개요 그림 추가
2022년 9월 14일 : 패킷 흐름에 대한 그림 추가

1. XDP(eXpress Data Path)

1.1. 개요

1.2. 개발 환경 구축

1.3. BPF : cBPF (Classic BPF) / eBPF (Extended Berkeley Packet Filter)

1.3.1. BPF Architecture
1.3.2. BPF 명령어 세트 (BPF instruction set)

1.4. AF_XDP socket (XSK)

1.5. Linux Kernel v5.x 에서의 XDP 및 xfrm(IPSec) 흐름 요약

1.6. 참고자료

1.1. 개요

[PNG image (66.7 KB)]

XDP(eXpress Data Path)는 [https]

Linux Kernel v4.8부터 merge

된 (새로운 Address family 정의 AF_XDP의 경우 [https]

Linux Kernel v4.18부터 merge

) eBPF 기반 고성능 Data 경로(Programmable 가능한 Network Data Path) 입니다.

많은 분들이 Linux Kernel을 사용하는 환경에서 네트워킹 속도를 높이기 위해 수 많은 조사 (Zero copy, Allocation / Context overhead, ...) 와 방안들( [http]

netmap

, DPDK(Data Plane Development Kit), [https]

VPP(Vector Packet Processing)

PF_ring

, manglev, Onload, [https]

Snabb

, ...)을 강구해왔습니다.

그 방안들 중에서 하나인 DPDK(Data Plane Development Kit)의 경우는 커널에서 네트워킹을 제거하고 사용자 영역에서 이를 처리하도록 하여 속도면에서는 이것을 어느정도 달성하게 되었으나 사용자 영역에서 커널에서 하던 수 많은 네트워킹 처리를 다시 구현해야 하는 문제로 모든 환경에 부합시키기는 골치아픈 문제가 있었습니다.

이제 여기서 다루는 XDP(eXpress Data Path) 라고 하는 사용자 영역 네트워킹의 대안을 제시하는 시점까지 왔으며 커널의 이점과 더 빠른 패킷 처리 간의 균형을 맞추는 방안으로써 자리를 잡고 있습니다. (그렇다고 XDP 자체만으로 모든 패킷처리에 빠른 만능 해결책을 제시하는 것으로 이해하면 안되며 Kernel내에서의 Bulk packet frame을 최대한 Overhead없이 bypass하는 방안을 제시하는 것이고 이를 실제 용도에 맞게 처리하는 것은 이를 이용하여 구현하는 사용자에게 달려 있습니다.)

핵심을 요약하면 Kernel에서 XDP 프로그램은 "struct sk_buff" 를 만들기도 전에 실행될 수 있습니다. "ip link set dev ... {xdp|xdpgeneric|xdpdrv|xdpoffload|...} ..." 와 같은 명령을 통해서 NIC장치에 이러한 XDP 프로그램을 붙이기 때문입니다. 여기서 장치 드라이버에서 XDP 지원이 구현돼 있으면 [https]

수신 루틴 초입

(i40 driver의 경우 "i40e_run_xdp(rx_ring, &xdp)", Driver내에서 DMA sync 직후 최대한 빠른 실행 위치) 에서 프로그램을 실행하고 아니면 [https]

좀 더 위에서 실행

(netif_receive_skb() 에서 "do_xdp_generic(rcu_dereference(skb->dev->xdp_prog), skb);")한다고 이해하면 됩니다. (추가적으로 tc 에서도 실행될 수 있습니다. - "net/sched/" 에서 bpf_prog_run 함수 호출부분)

이 문서는 개인적으로 XDP(eXpress Data Path)을 조사하면서 수집된 문서들을 바탕으로 이를 정리하는 것을 목표로 작성되었습니다. 경우에 따라서 개인적인 해석관점에서 작성되어 잘못된 내용이 있을 수 있다는 점은 염두하면서 읽어주시고 바로 잡아야 할 내용이 있으면 저에게 피드백을 요청드리면서 이 문서를 여러분과 함께 작성해보고자 하며 이미 선두에서 XDP(eXpress Data Path)를 탄생시키고 학습하고 자료를 남겨주신 선배님들과 피드백을 주실 모든 분들에게 감사의 뜻을 전합니다.

개인적으로 학습하면서 가장 유용했던 문서는 다음과 같으며 대부분은 이 문서만으로 이해하는데 문제가 없었습니다. 그 밖에 참고했던 문서들은 모두 본 문서의 맨 아래에 링크를 정리해두었습니다.

1.2. 개발 환경 구축

[PNG image (82.16 KB)]

기본적으로

libbpf

를 사용하여 BPF 프로그램을 만들고 이를 Kernel로 load 하도록 만드는 것에서부터 시작합니다.

BPF 프로그램은 제한적인 C로 작성된 프로그램을 LLVM + clang 으로 컴파일하여 ELF object file에 BPF byte code로 저장된 형태를 만들게 되며 이것을 bpf 시스템콜(syscall) 을 통하여 Kernel에 load 하여 실행하게 됩니다. 그리고 필요에 따라서 perf 유틸리티를 통하여 Kernel의 동작들을 추적할 수 있습니다.

[https]

LLVM

Clang

(+ libelf + libpcap + ...) 환경은 주요 배포판에서 다음과 같이 패키지를 설치할 수 있습니다. 또한 자신의 타겟 실행 환경이 개발환경과 같거나 다른 경우등의 조건에 따라서 Kernel headers, [https]

bpftool

, perf 등을 설치할 수 있습니다. (참고: [https]

https://github.com/xdp-project/xdp-tutorial/blob/master/setup_dependencies.org

)

Fedora 환경인 경우
```
$ sudo dnf install clang llvm
$ sudo dnf install elfutils-libelf-devel libpcap-devel
```
- Kernel headers 설치 (타겟 커널이 다른 경우 해당 커널의 헤더를 사용, 선택사항)
```
$ sudo dnf install kernel-headers
```
- bpftool 설치 (선택사항)
```
$ sudo dnf install bpftool
```
- perf 유틸리티 설치 (선택사항)
```
$ sudo dnf install perf
```

Debian / Ubuntu 환경인 경우

$ sudo apt install clang llvm
$ sudo apt install libelf-dev libpcap-dev gcc-multilib build-essential

Kernel headers 설치 (타겟 커널이 다른 경우 해당 커널의 헤더를 사용, 선택사항)
```
$ sudo apt install linux-headers-$(uname -r)
```

bpftool 설치 (선택사항)

Ubuntu 환경인 경우

$ sudo apt install linux-tools-common linux-tools-generic

perf 유틸리티 설치 (선택사항)
- Debian 환경인 경우
```
$ sudo apt install linux-perf
```
- Ubuntu 환경인 경우
```
$ sudo apt install linux-tools-$(uname -r)
```

openSUSE 환경인 경우
```
$ sudo zypper install clang llvm
$ sudo zypper install libelf-devel libpcap-devel
```
- Kernel headers 설치 (타겟 커널이 다른 경우 해당 커널의 헤더를 사용, 선택사항)
```
$ sudo zypper install kernel-devel
```
- bpftool 설치 (선택사항)
```
$ sudo zypper install bpftool
```
- perf 유틸리티 설치 (선택사항)
```
$ sudo zypper install perf
```

Ubuntu 환경인 경우

$ sudo apt install autoconf automake m4
$ sudo apt install pkg-config
$ sudo apt install libtool
$ sudo apt install cmake
$ sudo apt install kernel-package
$ sudo apt install iproute2
$ sudo apt install dwarves
...

sudo add-apt-repository ppa:cappelikan/ppa
sudo apt install mainline
sudo mainline --list
sudo mainline --install <version string>

eBPF 프로그램 간단한 예제 소스 ("xdp-test.c")

/* SPDX-License-Identifier: GPL-2.0 */
#include <linux/bpf.h>
#if 0L /* libbpf header 가 참조가능한 환경인 경우 */
#include <bpf/bpf_helpers.h>
#else /* no libbpf header */
# if !defined(SEC)
#  define SEC(name) \
        _Pragma("GCC diagnostic push")                                      \
        _Pragma("GCC diagnostic ignored \"-Wignored-attributes\"")          \
        __attribute__((section(name), used))                                \
        _Pragma("GCC diagnostic pop")                                       \

# endif
#endif

SEC("prog")
int  xdp_pass_func(struct xdp_md *ctx)
{
                return XDP_PASS;
}

char _license[] SEC("license") = "GPL"; /* license 지정이 GPL인 경우와 아닌 경우에 따라서 사용할 수 있는 helpers 함수와 기능에 차이가 있을 수 있음 */

Clang+LLVM 으로 컴파일 및 링크 예시 (libbpf library 및 header 는 "git clone https://github.com/xdp-project/libbpf.git" 으로 받을 수 있습니다.)
```
$ clang -target bpf -I "<libbpf의 header 위치>" -emit-llvm -c -o xdp-test.ll xdp-test.c
$ llc -march bpf -filetype obj -o xdp-test.o xdp-test.ll
```
eBPF 프로그램을 지정한 인터페이스에 로드&실행 (xdp-tools 는 "git clone https://github.com/xdp-project/xdp-tools.git" 으로 받을 수 있습니다.)
```
$ sudo ip link set dev "<인터페이스명>" xdp object xdp-test.o section "<section명>" verbose

또는 xdp-tools 를 받아서 설치한 환경인 경우

$ sudo xdp-loader load --verbose --mode native --section "<section명>" "<인터페이스명>" xdp-test.o
```
- container 기반에서 실행하려면 CAP_SYS_ADMIN or CAP_BPF(v5.8 이상), CAP_NET_ADMIN 권한이 필요하며 "/sys/kernel/debug", "/sys/fs/cgroup", "/sys/fs/bpf" ("mount bpffs /sys/fs/bpf -t bpf") 등에 대한 volume mapping과 privileged(=true) 권한등의 부여가 필요합니다.
- 일반 linux 환경에서 root 권한 없이도 일반 유저권한(non-root)으로 사용할 수 있도록 하려면 sysctl 의 kernel.unprivileged_bpf_disabled 을 변경해야 합니다.

eBPF 프로그램 로드 확인

$ sudo ip link show dev "<인터페이스명>"
<ifindex>: <인터페이스명>: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:a8:c3:19 brd ff:ff:ff:ff:ff:ff
    prog/xdp id 19 tag b8a375b5b20c0074 jited

또는 xdp-tools 를 받아서 설치한 환경인 경우

$ sudo xdp-loader status "<인터페이스명>"
CURRENT XDP PROGRAM STATUS:

Interface        Prio  Program name      Mode     ID   Tag               Chain actions
--------------------------------------------------------------------------------------
<인터페이스명>                                   native   19   b8a375b5b20c0074

eBPF 프로그램 제거

$ sudo ip link set dev "<인터페이스명>" xdp off

또는 xdp-tools 를 받아서 설치한 환경인 경우

$ sudo xdp-loader unload --all --verbose "<인터페이스명>"

eBPF 프로그램에서 bpf_trace_printk(fmt, fmt_size, ...) helper 함수로 특정 디버깅 메세지를 출력하도록 작성하면 해당 출력 메세지는 "/sys/kernel/debug/tracing/trace_pipe" 에서 확인할 수 있게 됩니다.
```
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
```
eBPF 프로그램으로 만들어진 elf 파일에 대한 세부 정보나 역 어셈블 확인을 위해서는 readelf 명령과 llvm-objdump 등을 통해서 확인 가능합니다.
```
$ readelf -a "<XDP program elf object>"
$ llvm -D "<XDP program elf object>"
```

1.3. BPF : cBPF (Classic BPF) / eBPF (Extended Berkeley Packet Filter)

cBPF(Classic BPF)는 1992년 부터 존재하였으나 이는 구식으로 취급되며 eBPF는 [https]

Linux Kernel v3.18

부터 등장하였습니다. 현재는 cBPF 도 내부적으로 eBPF로 투명하게 변환되어 처리됩니다.

cBPF는 tcpdump 및 일부 ping 명령어에서 사용하는 패킷 필터 언어로 알려져 있습니다. 이 문서는 특별히 cBPF에만 해당하는 경우 cBPF하는 용어를 사용하고 나머지는 BPF라는 용어를 사용하겠습니다.

1.3.1. BPF Architecture

BPF는 명령어 세트 (instruction set)만을 제공하고 그에 대한 구현을 정의하지 않으며 다음과 같은 하부구조를 가집니다.

효율적인 "키 / 값 저장소 (key / value stores)" 역할을 하는 맵(map)
Kernel 기능과 상호 작용하고 활용하기 위한 "도우미 함수 (helper functions)"
다른 BPF 프로그램 호출을 위한 "후속/연쇄 호출 (tail calls)"
보안 강화를 위한 기반요소
Objects(map, program) 들을 고정(pinning)하기 위한 의사 파일 시스템(pseudo file system)
NIC등에서 BPF를 offload 하기 위한 구조

LLVM은 BPF backend를 제공하므로 Clang과 같은 Frontend 도구를 사용하여 C로 작성된 source를 BPF object 파일로 Compile한 다음 Kernel에 load할 수 있습니다. 이를 통하여 Linux Kernel의 기본 성능을 희생하지 않으면서 프로그래밍 가능성을 허용할 수 있게 됩니다.

Kernel의 하위 시스템도 이러한 BPF의 하부구조에 일부라고 할 수 있습니다. 여기서 BPF 프로그램을 연결할 수 있는 "XDP BPF"와 "tc BPF" 가 있습니다. 이 두가지는 다음과 같은 특성을 갖습니다.

"XDP BPF"
- 수신된 packet이 있을 때 가장 빠른 위치에서 실행됩니다. 때문에 최고의 패킷 처리 성능을 달성할 수 있는 가능성을 제공합니다.
- 그러나 Network stack보다 일찍 발생하므로 Kernel의 Network stack을 통한 정보나 가공등이 이루어지지 않은 상태의 packet을 다루어야 합니다.
"tc BPF"
- Network stack에서 실행되므로 더 많은 정보와 가공등이 처리되고 Kernel의 주요 기능에 접근할 수 있습니다.

1.3.2. BPF 명령어 세트 (BPF instruction set)

BPF는 범용

RISC(Reduced Instruction Set Computer)

명령어 세트이며 Compiler backend(예: LLVM)를 통해 BPF 명령어로 Compile될 수 있는 C의 하위 집합으로 프로그램을 작성하기 위해 설계되었습니다. Kernel은 내부 JIT(just-in-time) compiler를 통해 최적의 실행 성능을 위해 Native opcode에 대응할 수 있습니다.

BPF 명령을 Kernel에서 실행하는데 있어서 이점은 다음과 같습니다.

Kernel과 User의 공간에 대한 경계를 넘지 않으면서 map [1] 을 공유할 수 있습니다. 즉, Packet을 User공간으로 이동하고 이를 다시 Kernel로 이동하는 경우가 없는 Container 정책을 달성할 수 있습니다.
프로그램이 가능한 Data 경로 유연성에 의해서 꼭 필요한 사항만 빌드해서 성능의 최적화를 크게 달성할 수 있습니다.
Kernel과 서비스 또는 Container들을 다시 시작하지 않고 트래픽 중단상황 없이 BPF 프로그램을 원자적으로 업데이트가 가능합니다. 또한 프로그램의 상태는 BPF map을 통해서 업데이트 전반에 걸쳐 유지시킬 수 있습니다.
BPF는 사용자 User 공간에 대해서 안정적인 ABI(Application binary interface)를 제공하며 별도의 타사 Kernel 모듈들이 필요하지 않습니다. 그리고 BPF 프로그램이 최신 Kernel version에서도 계속 실행될 수 있음이 보장되며 다양한 Architecture에 이식이 가능합니다.
BPF 프로그램은 기존 Kernel의 하부구조 (Driver, Netdevices, Tunnel, Protocol stack, Socket, ...) 및 도구(iproute2, ...)의 안정성을 사용하며 Kernel 충돌 및 중단등을 방지하기 위하여 내부 검증루틴을 통과하도록 되어 있습니다. BPF프로그램은 Kernel 기능에 대한 일반적인 '접착 코드 (glue code)'로써 여러가지 경우를 해결할 수 있습니다.

Kernel 내부의 BPF 프로그램 실행은 항상 Event 기반입니다. 예를 들면...

NIC에 추가된 BPF 프로그램이 있으면 Packet이 수신되었을 때 BPF 프로그램이 실행됩니다.
BPF 프로그램이 추가되고 kprobe가 있는 Kernel주소를 실행할 때 kprobe의 콜백 함수를 호출하고 BPF 프로그램의 실행을 추적합니다.

[PNG image (42.32 KB)]

BPF는 32비트 하위 Register를 포함한 11개의 64 bits register (r0 ~ r10), Program counter 및 512 bytes BPF 스택 영역, 그리고 Key-Value Store형태의 map 등으로 구성됩니다.

작동 모드는 기본적으로 64 bits 이며 32 bits 하위 Register는 특정 ALU(산술 논리 장치) 연산을 통해서만 접근할 수 있습니다. 32 bits 하위 Register는 기록될 때 64 bits로 0 확장 (zero extend) 됩니다.

Register r10은 읽기 전용이고 BPF stack 공간에 접근하기 위한 Frame pointer 주소를 갖는 유일한 Register입니다. 나머지 r0 - r9 레지스터는 범용이며 읽기 / 쓰기가 가능합니다.

BPF 프로그램은 기 정의된 Kernel 핵심부의 Helper 함수를 호출(module에서는 안됨)할 수 있으며, BPF 호출 규칙은 다음과 같이 정의됩니다.

Register r0은 Helper 함수 호출의 반환 값을 가질 수 있습니다.
- BPF 프로그램의 종료 값(xdp_action)으로도 이 Register가 사용됩니다. 이 경우 종료 값의 의미는 BPF 프로그램의 유형에 따라서 정의됩니다. 그리고 BPF 프로그램의 종료 값이 Kernel로 반환될 때는 32 bits 값으로 전달됩니다.
  - XDP_ABORTED(0) : 패킷을 DROP하고, XDP program 을 종료시킵니다.
  - XDP_DROP(1) : 패킷을 DROP합니다.
  - XDP_PASS(2) : 패킷을 Kernel network stack에서 처리하도록 합니다.
  - XDP_TX(3) : 패킷을 유입되었던 NIC로 전송합니다.
  - XDP_REDIRECT(4) : 패킷을 재지향('AF_XDP의 사용자 영역' 또는 '다른 NIC' 로 전송) 합니다.
  - 정의되지 않은 종료 값은 XDP_DROP으로 간주하거나 XDP_ABORTED 로 간주될 수 있습니다.
Register r1 ~ r5는 BPF 프로그램에서 Helper 함수로의 인수를 지정할 수 있습니다.
- Scratch register [2] 이기도 합니다. 즉, 이러한 인수를 여러 Helper 함수 호출에서 재사용하려면 BPF 프로그램이 이를 BPF stack으로 이동시키거나 호출자 쪽 저장 Register r6 ~ r9로 이동해야 합니다.
- BPF 프로그램이 실행될 때 Register r1 (첫번째 인수) 은 프로그램의 Context를 처음에 포함합니다. (이것은 마치 일반적인 C 프로그램의 main함수에서 argc/argv 쌍과 유사한 의미입니다.)
Register r6 ~ r9는 Helper 함수 호출 시 보존되어야 할 호출자 측을 위한 저장용입니다.

BPF 호출 규칙은 x86_64, arm64 및 기타 ABI(Application binary interface)에 직접 매핑하기에 일반적으로 충분하며 모든 BPF Register는 H/W CPU Register에 1:1 로 매핑될 수 있으므로 JIT는 호출 명령만 만들면 되고 함수 인수에 대한 배치를 위한 추가적인 이동은 필요하지 않습니다. 이 호출 규칙은 성능 저하 없이 일반적인 호출 상황을 다루도록 모델링되었습니다. 6개 이상의 함수 인수가 있는 호출은 현재 지원되지 않습니다. BPF 전용인 Kernel의 Helper 함수(BPF_CALL_0() ~ BPF_CALL_5() 함수)는 이 규칙을 염두에 두고 특별히 설계되었습니다.

64 bits architecture 에서의 자연스러운 pointer 산술연산을 수행하고 Helper 함수로 64 bits 값을 전달하고 64 bits 원자 연산등을 달성하기 위해서 BPF의 일반적인 동작은 64 bits 입니다.

BPF 프로그램당 최대 명령어 제한은 4096개로 제한되며 이는 설계상 빠르게 종료되도록 작성되어야 함을 의미합니다. (Kernel v5.1 이상부터는 이 제한이 100만개로 해제되었습니다. Kernel에서 BPF_MAXINSNS 및 BPF_COMPLEXITY_LIMIT_INSNS define을 참고) 명령어는 순방향 / 역방향 분기가 포함될 수 있으나 Kernel내 BPF 검증부에서 무한루프를 금지하므로 종료가 보장될 수 있습니다. 이는 Kernel내에서 BPF가 실행되기 때문에 안정성에 영향을 주지 않도록 확인하는 안전장치입니다. 그리고 BPF 프로그램이 다른 프로그램으로 연쇄 호출(BPF tail call : Kernel에서 bpf_tail_call 함수 참고)할 수 있는데 이 경우 33개까지만(Kernel에서 MAX_TAIL_CALL_CNT define 참고) 호출 중첩제한됩니다.

BPF 명령어 형식은 2개의 피연산자 명령어로 모델링되어 JIT(just-in-time) 단계에서 native code 로 매핑하게 됩니다. 모든 BPF 명령어 세트는 고정 크기인 64 bits 로 encoding 하며 현재는 총 87개의 명령어가 구현되어 있고 필요시 추가 명령어를 확장할 수 있습니다. 하나의 64 bits BPF 명령어는 Big-endian 장치에서 MSB(Most Significant Bit)부터 LSB(Least Significant Bit)까지 opcode 8 bits, dst_reg 4bits, src_reg 4 bits, off 16bits, imm 32bits 순으로 정의되며 명령어에 따라서 일부 항목을 사용하지 않는 경우가 있는데 이 경우 해당 항목은 0으로 초기화하여 사용합니다. Encoding 은 [https]

eBPF Instruction Set

및

Unofficial eBPF spec

을 참고할 수 있습니다.

BPF 명령어 Encoding
32 ~ 63 (32 bits) (MSB)	16 ~ 31 (16 bits)	12 ~ 15 (4 bits)	8 ~ 11 (4 bits)	0 ~ 7 (8 bits) (LSB)
immediate (imm: signed immediate constant)	offset (off: signed offset)	source register (src_reg)	destination register (dst_reg)	opcode

/* Linux kernel header : "include/uapi/linux/bpf.h" */
struct bpf_insn {
        __u8    code;           /* opcode */
        __u8    dst_reg:4;      /* dest register */
        __u8    src_reg:4;      /* source register */
        __s16   off;            /* signed offset */
        __s32   imm;            /* signed immediate constant */
};

opcode에서 'Arithmetic and jump instructions' (BPF_ALU, BPF_ALU64, BPF_JMP, BPF_JMP32) 은 다음과 같이 정의됩니다.

opcode: Arithmetic and jump instructions
4 bits (MSB)	1 bit	3 bits (LSB)
operation code	source	instruction class

'Arithmetic instructions' opcode에서 'operation code' 는 다음과 같이 정의됩니다.

Arithmetic instructions: operation code
code	값	설명	비고
BPF_ADD	0x00	dst += src
BPF_SUB	0x10	dst -= src
BPF_MUL	0x20	dst *= src
BPF_DIV	0x30	dst /= src
BPF_OR	0x40	dst \|= src
BPF_AND	0x50	dst &= src
BPF_LSH	0x60	dst <<= src
BPF_RSH	0x70	dst >>= src
BPF_NEG	0x80	dst = ~src
BPF_MOD	0x90	dst %= src
BPF_XOR	0xa0	dst ^= src
BPF_MOV	0xb0	dst = src	mov reg to reg
BPF_ARSH	0xc0	sign extending shift right	sign extending arithmetic shift right
BPF_END	0xd0	endianness conversion (flags for endianness conversion)	BPF_TO_LE/BPF_FROM_LE (0x00), BPF_TO_BE/BPF_FROM_BE (0x08)

'Jump instructions' opcode에서 'operation code' 는 다음과 같이 정의됩니다.

Jump instructions: operation code
code	값	설명	비고
BPF_JA	0x00	PC += off	BPF_JMP only
BPF_JEQ	0x10	PC += off if dst == src
BPF_JGT	0x20	PC += off if dst > src	unsigned
BPF_JGE	0x30	PC += off if dst >= src	unsigned
BPF_JSET	0x40	PC += off if dst & src
BPF_JNE	0x50	PC += off if dst != src
BPF_JSGT	0x60	PC += off if dst > src	signed
BPF_JSGE	0x70	PC += off if dst >= src	signed
BPF_CALL	0x80	function call
BPF_EXIT	0x90	function / program return (return r0)	BPF_JMP only
BPF_JLT	0xa0	PC += off if dst < src	unsigned
BPF_JLE	0xb0	PC += off if dst <= src	unsigned
BPF_JSLT	0xc0	PC += off if dst < src	signed
BPF_JSLE	0xd0	PC += off if dst <= src	signed

'Arithmetic and jump instructions' opcode에서 'source' operand 는 다음과 같이 정의됩니다.
The 4th bit encodes the source operand
source 값 설명

BPF_K 0x00 use 32-bit immediate as source operand

BPF_X 0x08 use 'src_reg' register as source operand

The 4th bit encodes the source operand
source	값	설명
BPF_K	0x00	use 32-bit immediate as source operand
BPF_X	0x08	use 'src_reg' register as source operand

opcode에서 'Load and store instructions' (BPF_LD, BPF_LDX, BPF_ST, BPF_STX) 은 다음과 같이 정의됩니다.

opcode: Load and store instructions
3 bits (MSB)	2 bits	3 bits (LSB)
mode	size	instruction class

'Load and store instructions' opcode에서 'size' modifier 는 다음과 같이 정의됩니다.
The size modifier
size modifier 값 설명

BPF_W 0x00 word (4 bytes)

BPF_H 0x08 half word (2 bytes)

BPF_B 0x10 byte (1 byte)

BPF_DW 0x18 double word (8 bytes)
- 16-byte instructions : eBPF에는 하나의 명령어가 일반적인 8 bytes (64 bits)가 아닌 16 bytes (64 bits x 2)로 구성되는 경우가 있습니다. "BPF_LD | BPF_DW | BPF_IMM" 은 2개의 연속적인 8 bytes (64 bits) 블록으로 구성되며 64 bits immediate 값을 dst_reg에 Load하는 단일 명령어로 해석됩니다.

The size modifier
size modifier	값	설명
BPF_W	0x00	word (4 bytes)
BPF_H	0x08	half word (2 bytes)
BPF_B	0x10	byte (1 byte)
BPF_DW	0x18	double word (8 bytes)

'Load and store instructions' opcode에서 'mode' modifier 는 다음과 같이 정의됩니다.

The mode modifier
mode modifier	값	설명
BPF_IMM	0x00	used for 64-bit mov
BPF_ABS	0x20	legacy BPF packet access
BPF_IND	0x40	legacy BPF packet access
BPF_MEM	0x60	all normal load and store operations
(reserved)	0x80	reserved
(reserved)	0xa0	reserved
BPF_ATOMIC	0xc0	atomic operations (atomic memory ops - op type in immediate)

opcode의 LSB bit의 'Instruction classes' 는 다음과 같이 정의됩니다.

Instruction classes
class	값	설명	비고
BPF_LD	0x00	non-standard load operations	Load instructions
BPF_LDX	0x01	load into register operations	Load instructions
BPF_ST	0x02	store from immediate operations	Store instructions
BPF_STX	0x03	store from register operations	Store instructions
BPF_ALU	0x04	32-bit arithmetic operations	Arithmetic instructions
BPF_JMP	0x05	64-bit jump operations	Jump instructions
BPF_JMP32	0x06	32-bit jump operations (Jump mode in word width)	Jump instructions
BPF_ALU64	0x07	64-bit arithmetic operations (ALU mode in double word width)	Arithmetic instructions

eBPF는 원자적 연산 (Atomic operations) [3] 을 위해서 opcode가 아닌 immediate 에 추가적인 encoding을 할 수 있습니다. 이 때 immediate에는 기본적으로 BPF_ADD, BPF_AND, BPF_OR, BPF_XOR 를 지원하며 BPF_FETCH (0x01)을 함께 set 하면 src_reg가 수정되기 전에 메모리에 있던 값 (dst_reg + off)으로 덮어씁니다. 그리고 immediate에 BPF_XCHG (0xe0 | BPF_FETCH) 와 BPF_CMPXCHG (0xf0 | BPF_FETCH) 를 사용할 수 있습니다. 현재 1 byte와 2 bytes 원자적 연산은 지원하지 않습니다.

=> "kernel/bpf/core.c" source 에서 ___bpf_prog_run() 함수구현 참고

#define BPF_ATOMIC_OP(SIZE, OP, DST, SRC, OFF)                  \
        ((struct bpf_insn) {                                    \
                .code  = BPF_STX | BPF_SIZE(SIZE) | BPF_ATOMIC, \
                .dst_reg = DST,                                 \
                .src_reg = SRC,                                 \
                .off   = OFF,                                   \
                .imm   = OP })

.imm = BPF_ADD, .code = BPF_ATOMIC | BPF_W  | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg
    BPF_ATOMIC_OP(sizeof(u32), BPF_ADD, <dst_reg>, <src_reg>, <off16>)
.imm = BPF_ADD, .code = BPF_ATOMIC | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
    BPF_ATOMIC_OP(sizeof(u64), BPF_ADD, <dst_reg>, <src_reg>, <off16>)

Atomic operations: immediate
immediate	값	설명	비고
BPF_ADD	0x00	(uint ) (dst_reg + off16) += src_reg	atomic_add(src_reg, dst_reg + off16)
BPF_AND	0x50	(uint ) (dst_reg + off16) &= src_reg	atomic_and(src_reg, dst_reg + off16)
BPF_OR	0x40	(uint ) (dst_reg + off16) \|= src_reg	atomic_or(src_reg, dst_reg + off16)
BPF_XOR	0xa0	(uint ) (dst_reg + off16) ^= src_reg	atomic_xor(src_reg, dst_reg + off16)
BPF_ADD \| BPF_FETCH	0x01 (0x00 \| 0x01)	tmp = (uint ) (dst_reg + off16), (uint ) (dst_reg + off16) += src_reg, src_reg = tmp	src_reg = atomic_fetch_add(src_reg, dst_reg + off16)
BPF_AND \| BPF_FETCH	0x51 (0x50 \| 0x01)	tmp = (uint ) (dst_reg + off16), (uint ) (dst_reg + off16) &= src_reg, src_reg = tmp	src_reg = atomic_fetch_and(src_reg, dst_reg + off16)
BPF_OR \| BPF_FETCH	0x41 (0x40 \| 0x01)	tmp = (uint ) (dst_reg + off16), (uint ) (dst_reg + off16) \|= src_reg, src_reg = tmp	src_reg = atomic_fetch_or(src_reg, dst_reg + off16)
BPF_XOR \| BPF_FETCH	0xa1 (0xa0 \| 0x01)	tmp = (uint ) (dst_reg + off16), (uint ) (dst_reg + off16) ^= src_reg, src_reg = tmp	src_reg = atomic_fetch_xor(src_reg, dst_reg + off16)
BPF_XCHG (0xe0 \| BPF_FETCH)	0xe1 (0xe0 \| 0x01)	tmp = (uint ) (dst_reg + off16), (uint ) (dst_reg + off16) = src_reg, src_reg = tmp	src_reg = atomic_xchg(dst_reg + off16, src_reg)
BPF_CMPXCHG (0xf0 \| BPF_FETCH)	0xf1 (0xf0 \| 0x01)	((uint ) (dst_reg + off16) == r0) ? (uint ) (dst_reg + off16) = src_reg : r0 = (uint ) (dst_reg + off16)	r0 = atomic_cmpxchg(dst_reg + off16, r0, src_reg)

Packet을 접근하는데 일반적이지 않은 접근 명령어 2가지 "BPF_ABS | <size> | BPF_LD" 와 "BPF_IND | <size> | BPF_LD" 가 있습니다. 이것은 eBPF interpreter에서 실행되는 cBPF가 갖는 Socket filter의 강력한 성능을 위해서 고려된 부분입니다. 이 명령어는 interpreter context 가 struct sk_buff 이고 7개의 암묵적 피연산자가 있는 경우에만 사용할 수 있습니다. r6는 sk_buff pointer를 지정해야 하며 r0는 Packet에서 가져온 data를 포함하는 암시적인 출력이며 r1 ~ r5는 Scratch register이고 "BPF_ABS | BPF_LD" 또는 "BPF_IND | BPF_LD" 명령어에서 저장하는데 사용하면 안됩니다. 그리고 이 명령어는 암묵적으로 프로그램이 Packet의 경계를 넘어서 접근하려고 하면 Interpreter는 실행을 중단하며 이를 위해서 src_reg 또는 imm32 항목은 명시적으로 지정해야 합니다.

예로써 "BPF_IND | BPF_W | BPF_LD"는 다음과 같은 의미입니다.

/* r0 = Data in packet, r1 - r5 are clobbered, r6 = struct sk_buff pointer, src_reg + imm32 = struct sk_buff data offset */
r0 = ntohl(*(u32 *) (((struct sk_buff *) r6)->data + src_reg + imm32));

1.4. AF_XDP socket (XSK)

[PNG image (335.13 KB)]

AF_XDP

는 고 성능의 패킷처리를 위해서 최적화된 Address family중에 하나입니다.

AF_XDP는 Kernel network stack을 거치지 않고 User space로 frames를 이동하는 것이며 Kernel을 완전히 우회(bypass) 하는 것은 아니지만 최대한 짧은 시간내에 Kernel내에서의 빠른 경로(in-kernel fast path)를 생성하며 Kernel space와 User space간 zero copy 및 XDP bytecode를 NIC에서 offloading 하는 이점등을 제공하며 Interrupt mode 및 Polling mode (DPDK polling mode driver는 항상 polling합니다. - 참고: [https]

DPDK PMD for AF_XDP

) 에서 실행가능합니다.

수신된 frames를 XDP가 활성화된 다른 netdev로 재지향하려면 XDP 프로그램은 bpf_redirect_map() 함수로 XDP_REDIRECT action을 사용합니다. AF_XDP socket (XSK) 을 사용하면 XDP 프로그램이 수신된 frames를 사용자 공간에 있는 응용프로그램의 memory buffer 로 재지향할 수 있습니다.

XDP 프로그램 예시 (Linux kernel source 에서 "samples/bpf/xdpsock_kern.c" 참고)

// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

#define MAX_SOCKS 4

/* This XDP program is only needed for the XDP_SHARED_UMEM mode.
 * If you do not use this mode, libbpf can supply an XDP program for you. (만약 이 모드를 사용하지 않으면 libbpf 에서 XDP 프로그램을 제공할수도 있음)
 */

struct {
        __uint(type, BPF_MAP_TYPE_XSKMAP);
        __uint(max_entries, MAX_SOCKS);
        __uint(key_size, sizeof(int));
        __uint(value_size, sizeof(int));
} xsks_map SEC(".maps");

static unsigned int rr;

SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
{
        rr = (rr + 1) & (MAX_SOCKS - 1);

        return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
}

AF_XDP socket (XSK) 는 일반 socket() 함수를 통해서 생성하면 되고 Rx ring, Tx ring 으로 구성되며 Rx ring으로 Packet을 수신하고 Tx ring으로 전송할 수 있습니다. 이러한 ring은 setsockopt()를 통하여 XDP_RX_RING 과 XDP_TX_RING 으로 이를 등록 및 크기를 조정합니다. Rx / Tx descriptor ring은 UMEM이라고 하는 memory buffer 를 가르키는데 Rx /Tx 는 동일한 UMEM을 공유할 수 있어 Rx와 Tx간 Packet을 복사할 필요가 없습니다.

[PNG image (205.04 KB)]

수신흐름에서는 다음과 같이 umem과 ring buffer 를 다루게 됩니다.

User side 에서는 umem free chunk를 만큼 fill queue (Producer, umem fq)에 채워줍니다.
- fill queue 에 넣을 수 있는 umem free chunk 수를 확인합니다.
- fill queue 에 넣을 umem free chunk 수만큼 예약(reserve)합니다.
- fill queue 에 umem free chunk로부터 할당한 chunk들을 대입(fill addr)합니다.
- fill queue 에 공급완료(submit)처리합니다.
Kernel side 에서는 수신된 데이터가 있는 경우 fill queue 에 공급된 chunk 들에 수신된 데이터를 저장하고 rx queue 로 옮깁니다.
User side 에서는 Consumer queue (rx queue)으로부터 수신된 패킷을 처리할 수 있습니다.

송신흐름에서는 다음과 같이 umem과 ring buffer 를 다루게 됩니다.

User side의 수신흐름으로부터 수신된 패킷을 tx queue 에 채워줍니다.
User side에서 Kernel side로 채웠음을 알리는 Signal (send함수 호출) 을 수행합니다.
Kernel side 에서는 tx queue 를 전송시작하게 되고 전송되 완료되면 Consumer queue (umem cq)로 옮깁니다.
User side는 Consumer queue (umem cq)로부터 송신완료된 chunk 를 umem free chunk 로 회수합니다.

Ring buffer 는 그것을 채워주는 Producer와 처리하는 Consumer 로 나뉩니다.

"struct xsk_ring_prod" 자료구조는 Producer 를 위한 자료구조로 미사용 umem chunk 에 대한 offset 을 채워주는 구현을 하게 됩니다.

__u32 n_umem_free_chunk;
__u32 n_chunk;
__u32 pos;
__u32 i;

n_umem_free_chunk = get_umem_free_chunks(); /* 여기서 get_umem_free_chunk 함수는 사용자가 구현하는 것으로 umem chunk 자원을 할당/해제하는 일련의 구현에서 해제된 사용가능한 chunk 개수를 반환하는 구현에 대응합니다. */
n_chunk = n_umem_free_chunk; /* 처음은 n_chunk 에 free chunk 수만큼을 대입하지만 이후 적절히 대입하는 구현이 고려되어야 합니다. */
for(;;) {
    n_reserved = xsk_ring_prod__reserve(&xsk_ring_prod, n_chunk, &pos);
    if(n_reserved == n_chunk) break;    /* 채워넣을 Producer 공간의 예약이 되었으면 break */

    /* Producer 예약이 요청한 갯수 n_chunk 만큼이 되지 않으면 */
    if(xsk_ring_prod__need_wakeup(&xsk_ring_prod)) {
        /* Producer 가 확보될 때까지 또는 일정시간 지연에 대한 구현 */
    }
}

for(i = 0u;i < n_chunk;i++) {
    /* CASE : 수신부 fill queue 목적인 경우 */   
    *xsk_ring_prod__fill_addr(&xsk_ring_prod /* umem_fq */, pos + i) = alloc_umem(); /* 여기서 alloc_umem 함수는 사용자가 구현하는 것으로 free umem chunk 로부터 1개의 chunk 를 할당하여 그 offset을 반환하는 구현에 대응합니다. */

    /* CASE : 송신부 tx queue 목적인 경우 */
    xsk_ring_prod__tx_desc(&xsk_ring_prod /* txq */, pos + i)->addr = <송신할 데이터가 채워진 chunk 주소>;
    xsk_ring_prod__tx_desc(&xsk_ring_prod /* txq */, pos + i)->len = <송신할 데이터의 크기>;
}
xsk_ring_prod__submit(&xsk_ring_prod, n_chunk); /* 이제 Producer 로 제공된 할당된 umem chunk를 처리할 수 있습니다. */

/* CASE : 송신부 tx queue 목적인 경우 */
if(xsk_ring_prod_needs_wakeup(&xsk_ring_prod /* txq */)) {
    sendto(xsk_socket__fd(xsk), NULL, 0, MSG_DONTWAIT, NULL, 0);   /* txq가 채워졌다고 알아서 Tx 되는 것은 아니며 이 때 Signal 목적의 send 함수를 호출해주어야 Tx 가 Trigger 됩니다. */
}

"struct xsk_ring_cons" 자료구조는 Producer 에서 채워넣은 umem chunk 영역을 처리 (수신된 데이터 또는 송신된 데이터)하기 위한 Consumer 를 위한 구현을 하게 됩니다.

__u32 n_cons_size;
__u32 pos;
__u32 n_chunk;
__u32 i;

n_cons_size = ...;  /* 여기서 n_cons_size 는 xsk_ring_cons 가 담을 수 있는 최대 chunk 수 입니다. */
n_chunk = xsk_rin_cons__peek(&xsk_ring_cons, n_cons_size, &pos);

for(i = 0u;i < n_chunk;i++) {
    /* CASE: Completion queue 목적인 경우 */
    __u64 addr = *xsk_ring_cons__comp_addr(&xsk_ring_cons, pos + i); /* chunk offset 을 얻어옵니다. */
    free_chunk(addr); /* 여기서 free_chunk 함수는 사용자가 구현하는 것으로 할당되었던 umem chunk인 addr을 umem chunk 로 반환하는 구현을 하게 됩니다. */

    /* CASE: rx queue 목적인 경우 */
    __u64 addr = xsk_ring_cons__rx_desc(&xsk_ring_cons, pos + i)->addr;
    __u64 len = xsk_ring_cons__rx_desc(&xsk_ring_cons, pos + i)->len;
}
xsk_ring_cons__release(&xsk_ring_cons, n_chunk); /* Consumer 처리를 n_chunk 만큼 완료로 갱신합니다. */

1.5. Linux Kernel v5.x 에서의 XDP 및 xfrm(IPSec) 흐름 요약

아래 그림은 Network packet 의 흐름을 Linux Kernel v5.x 기준으로 그려보았습니다. ( 너무 작게 보이는 경우 이미지를 새창보기 또는 새탭보기로 시도해보세요. )

[PNG image (5.39 MB)]

1.6. 참고자료

용어 정리

XDP : eXpress Data Path

BPF : Berkeley Packet Filter

eBPF : Extended Berkeley Packet Filter
- eBPF로 작성된 프로그램을 "bpf()" 시스템 콜을 통해서 Kernel에 전달하면 Kernel내에서 Sandbox [4] 형태의 인터프리터(Interpreter : JIT virtual machine)로 동작하게 됩니다.

cBPF : cBPF(Classic BPF)는 1992년 부터 존재하였으나 이는 구식으로 취급되며 eBPF는 [https]

Linux Kernel v3.18

부터 등장하였습니다. 현재는 cBPF 도 내부적으로 eBPF로 투명하게 변환되어 처리됩니다.

Socket filter 예제

이 예제소스는 cBPF(Classic Berkeley Packet Filter) code 를 이용한 내 Linux box 의 모든 TCP syn packet 인입을 잡아내는 Application side에서의 감시코드입니다.

More

/*
  Copyright (C) HWPORT.COM
  All rights reserved.
  Code by JaeHyuk Cho <mailto:minzkn@minzkn.com>
*/

#if !defined(_ISOC99_SOURCE)
# define _ISOC99_SOURCE (1L)
#endif

#if !defined(_GNU_SOURCE)
# define _GNU_SOURCE (1L)
#endif

#include <sys/types.h>
#include <sys/socket.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include <arpa/inet.h>

#include <linux/filter.h>
#include <linux/if_ether.h>
#include <linux/if.h>

int main(int s_argc, char **s_argv);

static void * (__mz_dump_ex__)(size_t s_s, size_t s_d, void * s_da, size_t s_si)
{ size_t s_o, s_w, s_lo; unsigned char s_b[ 16 + 1 ];
 if(s_da == NULL)return(NULL);
 s_b[sizeof(s_b) - 1] = (unsigned char)'\0';
 for(s_o = (size_t)0;s_o < s_si;s_o += (size_t)16){
  s_w = ((s_si - s_o) <= ((size_t)16)) ? (s_si - s_o) : ((size_t)16);
  (void)fprintf(stdout, "%08lX", (unsigned long)(s_d + s_o));
  for(s_lo = (size_t)0;s_lo < s_w;s_lo++){
   if(s_lo == ((size_t)(16 >> 1)))(void)fputs(" | ", stdout);
   else (void)fputs(" ", stdout);
   s_b[s_lo] = *(((unsigned char *)s_da) + (s_s + s_o + s_lo));
   (void)fprintf(stdout, "%02X", (int)s_b[s_lo]);
   if((s_b[s_lo] & ((unsigned char)(1 << 7))) || (s_b[s_lo] < ((unsigned char)' ')) ||
      (s_b[s_lo] == ((unsigned char)0x7f)))s_b[s_lo] = (unsigned char)'.';}
  while(s_lo < ((size_t)16)){
   if(s_lo == ((size_t)(16 >> 1)))(void)fputs("     ", stdout);
   else (void)fputs("   ", stdout);
   s_b[s_lo] = (unsigned char)' '; s_lo++;}
  (void)fprintf(stdout, " [%s]\n", (char *)(&s_b[0]));}
 return(s_da);
}
static __inline void * (mz_dump_ex)(size_t s_seek_offset, size_t s_display_offset, void * s_data, size_t s_size)
{ return(__mz_dump_ex__(s_seek_offset, s_display_offset, s_data, s_size)); }
#define mz_dump(m_data,m_size) mz_dump_ex((size_t)0,(size_t)0,(void *)(m_data),(size_t)(m_size))

int main(int s_argc, char **s_argv)
{
    int s_socket;
    unsigned char s_buffer[ 64 << 10 ];
    ssize_t s_recv_bytes;
# if 1L
    struct sock_filter s_bpf_code[] = { /* tcpdump -dd 'tcp[tcpflags] & (tcp-syn) != 0' */
        { 0x28, 0, 0, 0x0000000c },
        { 0x15, 0, 8, 0x00000800 },
        { 0x30, 0, 0, 0x00000017 },
        { 0x15, 0, 6, 0x00000006 },
        { 0x28, 0, 0, 0x00000014 },
        { 0x45, 4, 0, 0x00001fff },
        { 0xb1, 0, 0, 0x0000000e },
        { 0x50, 0, 0, 0x0000001b },
        { 0x45, 0, 1, 0x00000002 },
        { 0x6, 0, 0, 0x0000ffff },
        { 0x6, 0, 0, 0x00000000 },
    };
# else
    struct sock_filter s_bpf_code[] = { /* tcpdump -dd tcp */
        { 0x28, 0, 0, 0x0000000c },
        { 0x15, 0, 2, 0x000086dd },
        { 0x30, 0, 0, 0x00000014 },
        { 0x15, 3, 4, 0x00000006 },
        { 0x15, 0, 3, 0x00000800 },
        { 0x30, 0, 0, 0x00000017 },
        { 0x15, 0, 1, 0x00000006 },
        { 0x6, 0, 0, 0x00000800 },
        { 0x6, 0, 0, 0x00000000 }
    };
# endif
    struct sock_fprog s_filter;
    int s_check;

    (void)s_argc;
    (void)s_argv;

    s_socket = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP));
    if(s_socket == (-1)) {
        perror("socket");
        return(EXIT_FAILURE);
    }

    s_filter.len = sizeof(s_bpf_code) / sizeof(struct sock_filter);
    s_filter.filter = s_bpf_code;
    s_check = setsockopt(s_socket, SOL_SOCKET, SO_ATTACH_FILTER, (const void *)(&s_filter), (socklen_t)sizeof(s_filter));
    (void)s_check;

    for(;;) {
        s_recv_bytes = recv(s_socket, (void *)(&s_buffer[0]), sizeof(s_buffer), 0);
        if(s_recv_bytes == ((ssize_t)(-1))) {
            perror("recv");
            break;
        }

        (void)fprintf(stdout, "recv %ld bytes\n", (long)s_recv_bytes);
        mz_dump(s_buffer, s_recv_bytes);
    }

    (void)close(s_socket);

    return(EXIT_SUCCESS);
}

/* vim: set expandtab: */
/* End of source */

하나의 cBPF 는 크게 4가지 요소를 담고 있으며 첫번째 요소 16bits는 OP-CODE이고 두번째 요소 8bits 는 True 조건일 때 분기 Offset, 세번째 요소의 8bits False 조건일 때 분기 Offset, 그리고 마지막 요소 32bits 는 OP-CODE에 따른 특정 값을 넘겨줘야 할 때 사용합니다. OP-CODE 나 sock_filter 자료구조는 "linux/filter.h" 헤더를 참고하면 이해하기 쉽습니다.

cBPF 예1) ICMPv4 reply packet 중 특정 ident 값만 통과시키는 filter 예시 (이 경우 IPv4 header 를 건너뛰는 부분 포함)

More

            struct sock_filter s_bpf_code[] = { /* ident 가 다르거나 ICMP_ECHOREPLY 가 아닌 경우 버림 */
                BPF_STMT(BPF_LDX | BPF_B   | BPF_MSH, 0),          /* Skip IP header due BSD, see ping6. */
                BPF_STMT(BPF_LD  | BPF_H   | BPF_IND, 4),          /* Load icmp echo ident */
                BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 0xAAAA, 1, 0), /* Ours? */
                BPF_STMT(BPF_RET | BPF_K, 0),                      /* Echo with wrong ident. Reject. */
                BPF_STMT(BPF_LD  | BPF_B   | BPF_IND, 0),          /* Load icmp type */
                BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, ICMP_ECHOREPLY, 0, 1), /* Echo? */
                BPF_STMT(BPF_RET | BPF_K, ~0U),                    /* Yes, it passes. */
                BPF_STMT(BPF_RET | BPF_K, 0)                       /* Echo with wrong reply. Reject. */
            };

cBPF 예2) ICMPv6 reply packet 중 특정 ident 값만 통과시키는 filter 예시

More

            struct sock_filter s_bpf_code[] = { /* ident 가 다르거나 ICMP6_ECHO_REPLY 가 아닌 경우 버림 */
                BPF_STMT(BPF_LD  | BPF_H   | BPF_ABS, 4),          /* Load icmp echo ident */
                BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 0xAAAA, 1, 0), /* Ours? */
                BPF_STMT(BPF_RET | BPF_K, 0),                      /* Echo with wrong ident. Reject. */
                BPF_STMT(BPF_LD  | BPF_B   | BPF_ABS, 0),          /* Load icmp type */
                BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, ICMP6_ECHO_REPLY, 0, 1), /* Echo? */
                BPF_STMT(BPF_RET | BPF_K, ~0U),                    /* Yes, it passes. */
                BPF_STMT(BPF_RET | BPF_K, 0)                       /* Echo with wrong reply. Reject. */
            };

LLVM : (이전 이름: Low Level Virtual Machine => 지금은 LLVM그 자체가 이름임) 은 컴파일러의 기반구조. 언어에 가상 기계를 생성, 가상 기계가 언어에 독립적인 최적화를 실행
Clang : 클랭 - C, C++, 오브젝티브-C, 오브젝티브-C++ 프로그래밍 언어를 위한 컴파일러 프론트엔드. LLVM을 백엔드로 사용
- 대부분의 C 로 작성된 프로그램들은 CC 환경변수 빌드인 경우 다음과 같이 clang으로 빌드할 수 있음.
```
make CC=clang
```

BCC

: BPF Compiler Collection

BCC is a toolkit for creating efficient kernel tracing and manipulation programs, and includes several useful tools and examples. It makes use of extended BPF (Berkeley Packet Filters), formally known as eBPF, a new feature that was first added to Linux 3.15. Much of what BCC uses requires Linux 4.1 and above.

kprobes : Kernel dynamic instrumentation
uprobes : User-level dynamic instrumentation

리눅스 커널 포팅 가이드
DPDK(Data Plane Development Kit)
VPP (FD.io) 정리 (FD.io VPP 개발자를 위한 한국어 레퍼런스)
OSI 7 계층모델
Socket filter
VPN(Virtual Private Network, 가상사설망)
Computing the Internet Checksum (RFC1071)
Linux Kernel의 skbuff(Socket buffer descriptors)에 대하여
iptables 사용법
Netlink socket에 대하여
XDP Project
- XDP Hands-On Tutorial
- xdp-tools - Library and utilities for use with XDP
  - RaspberryPi 환경에서 빌드하는 경우 이슈 - "/usr/include/linux/types.h:5:10: fatal error: 'asm/types.h' file not found"
- XDP Paper
  - The eXpress Data Path: Fast Programmable Packet Processing in the Operating System Kernel
AF_XDP documentation on kernel.org
eBPF Instruction Set
VPP(Vector Packet Processing) - FD.io
Packet MMAP documentation on kernel.org
eBPF and XDP walkthrough and recent updates at FOSDEM 2017 by Daniel Borkmann, Cilium
Fast Packet Processing in Linux with AF_XDP at FOSDEM 2018 by Magnus Karlsson, Intel
eBPF.io - Introduction, Tutorials & Community Resources
L4Drop: XDP DDoS Mitigations, Cloudflare
Unimog: Cloudflare's edge load balancer, Cloudflare
Open-sourcing Katran, a scalable network load balancer, Facebook
Cilium's L4LB: standalone XDP load balancer, Cilium
Kube-proxy replacement at the XDP layer, Cilium
Cilium Documentation - cilium.pdf
eCHO Podcast on XDP and load balancing
BPF and XDP Reference Guide
- https://github.com/ebpf-io/ebpf.io
The Path to DPDK Speeds for AF XDP
XDP ACCELERATION USING NIC META DATA - Intel
Low-Latency, Deterministic Networking with Standard Linux using XDP Sockets
Integrating AF_XDP into DPDK
XDP (eXpress Data Path) as a building block for other FOSS projects
Express Data Path From Wikipedia
LLVM From Wikipedia
Clang From Wikipedia
Linux eBPF(Extended Berkeley Packet Filter)란?
A thorough introduction to eBPF
XDP IO Visor Project
K8s 에서의 eBPF/XDP 기반 고성능 & 고가용성 NAT 시스템
Capturing network traffic in an eXpress Data Path (XDP) environment - RED HAT BLOG
BCC-Linux 성능 모니터링, 네트워킹 등을위한 동적 추적 도구
Express_Data_Path.pdf From iovisor
LWN - Implementing eBPF for Windows
LWN - Accelerating networking with AF_XDP
LWN - The BPF system call API, version 14
LWN - Zero-copy network transmission with io_uring
LWN - BPF: sockmap and sk redirect support
BPF-Based Linux Firewall "bpfilter" Shows Impressive Performance Potential
BPF / XDP 8월 세미나 KossLab
Kernel bypass From Cloudflare
DPDK (Data Plane Development Kit) Project
- Programmer’s Guide
  - 58. IPsec Packet Processing Library
- DPDK PMD for AF_XDP
mTCP
Snabb: Simple and fast packet networking
netmap - the fast packet I/O framework
How to drop 10 million packets per second From Cloudflare
XDP: A SIMPLE LIBRARY FOR TEACHING A DISTRIBUTED PROGRAMMING MODULE
확장 BPF - 네트워크 언저리
The BSD Packet Filter: A New Architecture for User-level Packet Capture
Zero-Copy BPF
Documentation/networking/packet mmap.txt
Linux Socket Filtering aka Berkeley Packet Filter (BPF)
Unofficial eBPF spec
tc-bpf(8) man page
기술컬럼 - Linux 게임 서버 성능 분석에 eBPF + BCC 활용하기
uBPF (Userspace eBPF VM)
eBPF를 통한 클라우드 네트워킹 성능 향상 - NETLOX (Loxilight)
Extending the matching abilities of OpenFlow
bpf(2) man page
The cornerstone of new Linux network technology -- ebpf and XDP
eBPF Summit 2020 On Demand
- Day1
- Day2
XDP Forwarding
Firewalling with BPF/XDP: Examples and Deep Dive
Awesome eBPF - A curated list of awesome projects related to eBPF
XDP (eXpress Data Path) 기반 피어링 라우터 구축
LLVM Documentation
- The LLVM Target-Independent Code Generator
AF_XDPアプリケーション性能特性の定性的評価〜レイテンシ編
Linux tc and eBPF
Full introduction EBPF-Concept
Introduction to eBPF and XDP - (mcorbin.fr)
XDP program ip link error: Prog section rejected: Operation not permitted
tc/BPF and XDP/BPF - Hangbin Liu

https://legacy.netdevconf.info/0x14/pub/slides/54/[1]%20XDP%20meta%20data%20acceleration.pdf : XDP meta-data Acceleration - Saeed Mahameed

struct xdp_buff *xdp {
...
void *data_meta;
...
}

xdp_set_data_meta_invalid(&xdp);
int bpf_xdp_adjust_meta(struct xdp_buff *xdp_md, int delta);
int bpf_xdp_adjust_head(struct xdp_buff *xdp_md, int delta);

driver on XDP RX packet :
    xdp_buff.data_meta = xdp_buff->data - sizeof(meta_data);
    *xdp_buff.data_meta = meta_data;
XDP user program:
    meta_data = (struct meta_data*)xdp_buff->data_meta;
...

AF-XDP: How do I get ctx->data_meta from kernel into user-space? - Stackoverflow
BPF Features by Linux Kernel Version

libxdp - Man Page

Kernel and BPF program feature compatibility

To get the full benefit of all features, libxdp needs to be used with kernel 5.10 or newer, unless the commits mentioned below have been backported.
...

libbpf documentation - readthedocs.io
Bringing TSO/GRO and Jumbo frames to XDP
LWN - More flexible memory access for BPF programs
Re: PATCHSET v6 sched: Implement BPF extensible scheduler class
참고영상
XDP (eXpress Data Path) as a building block for other FOSS projects
참고 영상

[Youtube movie]

Download xdp_building_block.pdf

Kernel-bypass techniques for high-speed network packet processing
참고 영상

[Youtube movie]

BPF Internals (eBPF)
참고 영상

[Youtube movie]

Slideshare: BPF Internals (eBPF) by Brendan Gregg

Kernel-bypass networking for fun and profit
참고 영상

[Youtube movie]

Linux Networking - eBPF, XDP, DPDK, VPP - What does all that mean? (by Andree Toonk)
참고 영상

[Youtube movie]

A Beginner's Guide to eBPF Programming for Networking - Liz Rice, Isovalent
참고 영상

[Youtube movie]

13 Poll Mode Driver for XDP Zero Copy Sivaprasad Tummala, Intel India
참고 영상

[Youtube movie]

DPDK PMD for AF_XDP
참고 영상

[Youtube movie]

XDP (eXpress Data Path) as a building block for other FOSS projects
참고 영상
[Youtube movie]
Download xdp_building_block.pdf

Kernel-bypass techniques for high-speed network packet processing
참고 영상
[Youtube movie]

BPF Internals (eBPF)
참고 영상
[Youtube movie]
Slideshare: BPF Internals (eBPF) by Brendan Gregg

Kernel-bypass networking for fun and profit
참고 영상
[Youtube movie]

Linux Networking - eBPF, XDP, DPDK, VPP - What does all that mean? (by Andree Toonk)
참고 영상
[Youtube movie]

A Beginner's Guide to eBPF Programming for Networking - Liz Rice, Isovalent
참고 영상
[Youtube movie]

13 Poll Mode Driver for XDP Zero Copy Sivaprasad Tummala, Intel India
참고 영상
[Youtube movie]

DPDK PMD for AF_XDP
참고 영상
[Youtube movie]

----

[1] Map 은 eBPF program들과 User program들이 정보를 공유하기 위한 Key-Value Store (KVS) 를 의미
[2] Scratch register란 함수 또는 문맥내에서 해당 Register가 훼손(변경)될 수 있다는 의미이며 이에 따라서 호출자는 저장/복원에 대한 고려가 필요한 Register로써 다루어야 한다는 것을 내포합니다.
[3] 여러 CPU core나 thread에서 동일한 메모리를 수정할 가능성이 있는 경우 일련의 수정과정 중에 서로 영향을 받지 않고 안전하게 연산하는 것을 원자적 연산이라고 합니다. 보통은 CPU에서 지원하는 명령어를 직접 사용하여 memory bus를 잠금으로써 대부분 구현됩니다. x86 계열의 경우 assembly code중에 lock prefix를 사용할 수 있습니다.
[4] 샌드박스(sandbox)란 외부로부터 들어온 프로그램이 보호된 영역에서 동작해 시스템이 부정하게 조작되는 것을 막는 보안 형태를 의미합니다.