跳转至

eBPF

1 术语

  • 静态插桩:()
  • 用户态静态跟踪插桩:(USDT)(user level statically defined tracing)
  • 动态跟踪插桩:(kprobe、uprobe)
  • BFP字节码:BFP虚拟指令集
  • JIT(just in time)编译器:即时编译器,用于编译BFP字节码,比解释器效率高,JIT编译的代码和其它内核函数一样直接在处理器上执行

2 BPF和eBPF

1. 一开始扩展的BPF被称作eBPF,但现在BPF指的就是eBPF,内核中实现的BPF也是即支持经典的BFPF和扩展的eBPF。
2. BPF原先指的是伯克利包过滤器(Berkeley Packet Filter),但现在已经不是指网络数据包过滤了,更应该是一种技术,不再认为是Berkeley Packet Filter的缩写

BPF和eBPF比较

3 BPF

BPF通常被称作虚拟机,但JIT编译的代码和其它内核函数一样,直接运行在处理器上。

3.1 BPF的优势

  1. BPF过滤、操作数据直接在内核态进行,而传统工具则将数据拷贝到用户态,相比这种方式,BPF效率和性能更好。

3.2 BPF的限制

  1. BPF程序不能随意调用内核函数;只能调用在API中定义的BPF辅助函数。在后续版本中随着需求的增加,在AP1中会加入更多的辅助函数。BPF程序在执行循环时也有限制:允许BPF将一个无限循环插入kprobes是不安全的,因为这些线程可能还持有重要的锁,从而导致整个系统死锁。解决方法包括循环展开,以及在使用循环的通用场景中增加特定的辅助函数等。Linux53内核支持BPF受限循环,该循环的上限可以通 过验证器验证。
  2. BPF栈的大小设定为不能超过MAXBPFSTACK,值为512。这个限制在编写BPF观测工具时会碰到,尤其是在往栈上存放多个字符缓冲区时:一个char[256]缓存就可以消耗一半的栈空间。目前并没有增大这个限制的计划。解决方法是使用BPF映射存储空间,映射存储空间是有大小限制的。在bpftrace项目中,将字符串的存储位置从栈空间转移到映射的工作己经开始。
  3. BPF程序的总指令的数量,最初限制为4096。长的BPF程序有时会碰到这个限制(如果没有LLVM的编译优化,可能会更早碰到这个限制)。Linux5.2内核极大地提升了这个值的上限,使得它不再是一个需要考虑的问题。3BPF验证器的作用是接受一切安全的程序,指令数量限制不应该成为问题。

3.3 BPF结构

Linux BPF 运行时( runtime )的各模块的架构所示,它展示了 BPF 指令如何通过 BPF 验证器验证,再由 BPF 虚拟机执行。 BPF 虚拟机的实现既包括一个解释器.又包括一个 JIT 编译器:JIT 编译器负责生成处理器可直接接执行的机器指令。验证器会拒绝那些不安全的操作,这包括针对无界循环的检查: BPF 程序必须在有限的时间内完成。 BPF可以利用辅助函数获取内核状态,利用BPF映射表进行存储。BPF程序在特定事件触发时执行,包括kprobes、uprobes和tracepoint等事件 BPF内部结构

4 bcc

BCC - 基于 BPF 的 Linux IO 分析、网络、监控等工具

4.1 bcc 安装

  • linux各发行版安装步骤 亲测使用yum安装,各种问题,故采用源码安装方式
  • 安装后,相关工具会放在/usr/share/bcc/tools目录下,这些工具都是用python编写
[root@centos7 tools]# cd /usr/share/bcc/tools
[root@centos7 tools]# ls
argdist       btrfsdist    cpuunclaimed  doc         filetop         javaflow     llcstat         nodegc       perlflow   pythoncalls  rubygc      slabratetop  syncsnoop  tcpconnect  tcptracer   xfsdist
bashreadline  btrfsslower  dbslower      drsnoop     funccount       javagc       mdflush         nodestat     perlstat   pythonflow   rubyobjnew  sofdsnoop    syscount   tcpconnlat  tplist      xfsslower
biolatency    cachestat    dbstat        execsnoop   funclatency     javaobjnew   memleak         offcputime   phpcalls   pythongc     rubystat    softirqs     tclcalls   tcpdrop     trace
biosnoop      cachetop     dcsnoop       ext4dist    funcslower      javastat     mountsnoop      offwaketime  phpflow    pythonstat   runqlat     solisten     tclflow    tcplife     ttysnoop
biotop        capable      dcstat        ext4slower  gethostlatency  javathreads  mysqld_qslower  oomkill      phpstat    reset-trace  runqlen     sslsniff     tclobjnew  tcpretrans  vfscount
bitesize      cobjnew      deadlock      filelife    hardirqs        killsnoop    nfsdist         opensnoop    pidpersec  rubycalls    runqslower  stackcount   tclstat    tcpsubnet   vfsstat
bpflist       cpudist      deadlock.c    fileslower  javacalls       lib          nfsslower       perlcalls    profile    rubyflow     shmsnoop    statsnoop    tcpaccept  tcptop      wakeuptime

4.1.1 内核升级

从Linux内核3.15开始,添加了eBPF。更多的功能在Linux内核4.1中,官方推荐使用4.1以上内核版本。 - 参考[[Linux运维#1 5 升级系统内核]],安装内核时需要将header包也安装,否则后面使用bcc工具时会报找不到内核头文件的错误,如yum --enablerepo=elrepo-kernel install kernel-ml kernel-mt-headers kernel-mt-tools kernel-mt-devel -y

4.1.2 安装 gcc

  • 参考[[Linux软件安装#1 2 gcc]]

4.1.3 安装 cmake

  • 参考[[Linux软件安装#1 3 cmake]]

4.1.4 安装 llvm 相关组件

  1. llvm-project项目地址
  2. 此项目下安装包较多,请务必确认下载的是llvm-project,比如llvm-project-14.0.6.src.tar.xz
  3. 安装
[root@centos7 ~]# wget https://github.com/llvm/llvm-project/releases/download/llvmorg-14.0.6/llvm-project-14.0.6.src.tar.xz
[root@centos7 ~]# tar -xf llvm-project-14.0.6.src.tar.xz
[root@centos7 llvm-project-14.0.6.src]# mkdir build&&cdbuild
[root@centos7 build]# cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_RTTI=ON -DLLVM_ENABLE_PROJECTS="clang;libcxx;libcxxabi"  -DLLVM_TARGETS_TO_BUILD="BPF;X86" -G "Unix Makefiles" ../llvm
[root@centos7 build]# make
[root@centos7 build]# make install

4.1.5 安装 bcc 依赖

yum install -y epel-release
yum update -y
yum groupinstall -y "Development tools"
yum install -y elfutils-libelf-devel bison flex ncurses-devel
yum install -y luajit luajit-devel  # for Lua support

4.1.6 安装 bcc

[root@centos7 ~]# wget https://github.com/iovisor/bcc/releases/download/v0.25.0/bcc-src-with-submodule.tar.gz
[root@centos7 ~]# tar -xzf bcc-src-with-submodule.tar.gz
[root@centos7 ~]# cd bcc && mkdir build && cd build
[root@centos7 build]# cmake ..
[root@centos7 build]# make
[root@centos7 build]# make install

4.1.7 问题汇总

4.1.7.1 libbpf/src/bpf.h: No such file or directory

- 解决办法:从gitlab上下载bcc-src-with-submodule.tar.gz安装

4.1.7.2 ImportError: No module named bc
  • 问题原因:通常是同时安装了python2和python3,make install部分python依赖装到了python3路径下,但系统默认以python2执行
  • 解决办法:
    • 方法一:指定python3版本:比如python3 opensnoop
    • 方法二:修改python的软连接,执行python3,而不是python2(下面是具体操作)
# 查询python二进制所在位置
[root@centos7 tools]# whereis python
python: /usr/bin/python /usr/bin/python2.7 /usr/bin/python3.6 /usr/bin/python3.6m /usr/lib/python2.7 /usr/lib/python3.6 /usr/lib64/python2.7 /usr/lib64/python3.6 /etc/python /usr/include/python2.7 /usr/include/python3.6m /usr/share/man/man1/python.1.gz
[root@centos7 tools]# cd /usr/bin
[root@centos7 bin]# ll |grep python
lrwxrwxrwx.   1 root root         7 Aug  1 03:10 python -> python2
lrwxrwxrwx.   1 root root         9 Aug  1 03:10 python2 -> python2.7
-rwxr-xr-x.   1 root root      7144 Jun 28 23:30 python2.7
lrwxrwxrwx.   1 root root         9 Aug 15 04:17 python3 -> python3.6
-rwxr-xr-x.   2 root root     11328 Nov 17  2020 python3.6
-rwxr-xr-x.   2 root root     11328 Nov 17  2020 python3.6m
# 删除python2的软连接
[root@centos7 bin]# unlink python
# 链接python3
[root@centos7 bin]# ln -s python3 python
[root@centos7 bin]# ll |grep python
lrwxrwxrwx.   1 root root         7 Aug 16 06:56 python -> python3
lrwxrwxrwx.   1 root root         9 Aug  1 03:10 python2 -> python2.7
-rwxr-xr-x.   1 root root      7144 Jun 28 23:30 python2.7
lrwxrwxrwx.   1 root root         9 Aug 15 04:17 python3 -> python3.6
-rwxr-xr-x.   2 root root     11328 Nov 17  2020 python3.6
-rwxr-xr-x.   2 root root     11328 Nov 17  2020 python3.6m

4.2 bcc 工具使用

4.2.1 funccount

title: 解决的问题?
1. 某个内核态或用户态函数是否被调用过?
2. 该函数每秒被调用了多少次?
funccount 支持模式搜索;支持*通配符;还可以将-r用于正则表达式。
  • 功能:统计function、 tracepoint和与模式匹配的USDT探测调用出现的次数。
  • 语法:funccount [options] eventname
    • options列表
      • -d DURATION:跟踪的持续时间(单位秒)
      • -i INTERVAL:每隔多少秒打印输出
      • -p PID:跟踪指定的进程
      • -c CPU:仅在此cpu上跟踪
      • -r:使用正则表达式
    • pattern类别
      • t:(子系统:事件名):tracepoint,比如funccount t:kmem:kmalloc
      • u::USDT,比如funccount -p 185 u:node:gc*
      • c:libc函数:uprobe(lib函数),比如funccount c:malloc
      • 二进制文件全路径或相对路径:函数符号:uprobe,比如funccount /root/a.out:_Z15printCallStack2v
      • 内核函数:kprobe,比如funccount vfs_open
  • 官方示例
# 1. Count kernel functions beginning with "vfs_", until Ctrl-C is hit:
$ funccount 'vfs_*'
# 2. Count kernel functions beginning with "tcp_send", until Ctrl-C is hit:
$ funccount 'tcp_send*'
# 3. Print kernel functions beginning with "vfs_", every second:
$ funccount -i 1 'vfs_*'
# 4. Print kernel functions beginning with "vfs_", for ten seconds only:
$ funccount -d 10 'vfs_*'
# 5. Match kernel functions beginning with "vfs_", using regular expressions:
$ funccount -r '^vfs_.*'
# 6. Count vfs calls for process ID 181 only:
$ funccount -p 181 'vfs_*'
# 7. Count calls to the sched_fork tracepoint, indicating a fork() performed:
$ funccount t:sched:sched_fork
# 8. Count all GC USDT probes in the Node process:
$ funccount -p 185 u:node:gc*
# 9. Count all malloc() calls in libc:
$ funccount c:malloc
# 10. Count kernel functions beginning with "vfs_" on CPU 1 only:
$ funccount -c 1 'vfs_*'
4.2.1.1 跟踪用户程序函数
  • 使用nm -s命令查看二进制文件的symbols信息,选择需要的函数名
  • 语法:funccount 二进制文件全路径或相对路径:函数名
[root@centos7 ~]# funccount /root/a.out:_Z15printCallStack2v 
Tracing 1 functions for "b'/root/a.out:_Z15printCallStack2v'"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
_Z15printCallStack2v                        1
Detaching...
  • 使用模式匹配*来表示所有函数
  • 语法:funccount 二进制文件全路径或相对路径:*
  • 示例
[root@centos7 ~]# funccount /root/a.out:*
Tracing 15 functions for "b'./a.out:*'"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
_Z5test3i                                   1
_fini                                       1
_Z5test2i                                   1
main                                        1
_Z5test1i                                   1
_init                                       1
_start                                      1
__libc_csu_init                             1
_Z15printCallStack2v                        1
deregister_tm_clones                        1
register_tm_clones                          1
__do_global_dtors_aux                       1
frame_dummy                                 1
4.2.1.2 跟踪 libc 函数
此方式只是针对libc做的特殊处理,同样可以使用上面跟踪内核态函数的方法
  • 语法:funccount c:libc函数
  • 示例:跟踪open函数
[root@centos7 ~]# funccount c:open
Tracing 1 functions for "b'c:open'"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
open                                      247
Detaching...
  • 示例:模式匹配
Tracing 9 functions for "b'c:open*'"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
opendir                                     1
open64                                     12
Detaching...
  • 示例:使用内核态函数跟踪通用方法
[root@centos7 perf_example]# funccount /lib64/libc.so.6:opendir
Tracing 1 functions for "b'/lib64/libc.so.6:opendir'"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
opendir                                     2
Detaching...
4.2.1.3 跟踪内核函数
  • 语法:function 内核函数
  • 示例:跟踪指定内核函数
[root@centos7 perf_example]# funccount  vfs_open
Tracing 1 functions for "b'vfs_open'"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
vfs_open                                    6
Detaching...
  • 示例:模糊匹配模式
[root@centos7 perf_example]# funccount  vfs_*
Tracing 71 functions for "b'vfs_*'"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
vfs_statx                                   1
vfs_fstatat                                 1
vfs_fsync_range                             3
vfs_statfs                                  9
vfs_write                                  37
vfs_read                                   82
Detaching...
4.2.1.4 跟踪 tracepoint
  • 语法:funccount t:跟踪点
[root@centos7 events]# funccount t:vsyscall:emulate_vsyscall
Tracing 1 functions for "b't:vsyscall:emulate_vsyscall'"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
Detaching...
4.2.1.5 跟踪 USDT
  • 语法:funccount u:USDT探针*

4.2.2 stackcount

title: 解决的问题?
1. 某个事件为什么会被调用?调用的代码路径是什么?
2. 有哪些不同的代码路径会调用该事件,它们的调用频次如何?
  • 功能:对导致某事件发生的函数调用栈进行计数。
  • 语法:stackcount [options] eventname
    • option列表
      • -s:栈函数显示地址偏移量
      • -v:栈函数显示raw地址
      • -d:内核函数和用户态函数用--分割
      • -U:只显示用户态栈
      • -K:只显示内核栈
      • -T:显示时间
    • pattern类别,和funccount一样
      • t:(子系统:事件名):tracepoint,比如stackcount t:kmem:kmalloc
      • u::USDT,比如stackcount -p 185 u:node:gc*
      • c:libc函数:uprobe(lib函数),比如stackcount c:malloc
      • 二进制文件全路径或相对路径:函数符号:uprobe,比如stackcount /root/a.out:_Z15printCallStack2v
      • 内核函数:kprobe,比如stackcount vfs_open
  • 示例
[root@centos7 ~]# stackcount  vfs_read
Tracing 1 functions for "vfs_read"... Hit Ctrl-C to end.
^C
  vfs_read
  __x64_sys_read
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  [unknown]
  [unknown]
    1
[root@centos7 ~]# stackcount -vs vfs_read
Tracing 1 functions for "vfs_read"... Hit Ctrl-C to end.
^C
  ffffffffba3467b1 vfs_read+0x1
  ffffffffba346cea __x64_sys_read+0x1a
  ffffffffbaab014b do_syscall_64+0x3b
  ffffffffbac0009b entry_SYSCALL_64_after_hwframe+0x63
  7f976820e75d     [unknown]
    1

4.2.3 trace

title: 解决的问题?
1. 当某个内核态/用户态函数被调用时,调用参数是什么?
2. 这个函数的返回值是什么?调用失败了吗?
3. 这个函数是如何被调用的?相应的用户态或内核态函数调用栈是什么?
  • 选项
  -h, --help            show this help message and exit
  -b BUFFER_PAGES, --buffer-pages BUFFER_PAGES
                        number of pages to use for perf_events ring buffer
                        (default: 64)
  -p PID, --pid PID     id of the process to trace (optional)
  -L TID, --tid TID     id of the thread to trace (optional)
  --uid UID             id of the user to trace (optional)
  -v, --verbose         print resulting BPF program code before executing
  -Z STRING_SIZE, --string-size STRING_SIZE
                        maximum size to read from strings
  -S, --include-self    do not filter trace's own pid from the trace
  -M MAX_EVENTS, --max-events MAX_EVENTS
                        number of events to print before quitting
  -t, --timestamp       print timestamp column (offset from trace start)
  -u, --unix-timestamp  print UNIX timestamp instead of offset from trace
                        start, requires -t
  -T, --time            print time column
  -C, --print_cpu       print CPU id
  -c CGROUP_PATH, --cgroup-path CGROUP_PATH
                        cgroup path
  -n NAME, --name NAME  only print process names containing this name
  -f MSG_FILTER, --msg-filter MSG_FILTER
                        only print the msg of event containing this string
  -B, --bin_cmp         allow to use STRCMP with binary values
  -s SYM_FILE_LIST, --sym_file_list SYM_FILE_LIST
                        comma separated list of symbol files to use for symbol
                        resolution
  -K, --kernel-stack    output kernel stack trace
  -U, --user-stack      output user stack trace
  -a, --address         print virtual address in stacks
  -I header, --include header
                        additional header files to include in the BPF program
                        as either full path, or relative to current working
                        directory, or relative to default kernel header search
                        path
  -A, --aggregate       aggregate amount of each trace
  • 官方示例:
trace do_sys_open
        Trace the open syscall and print a default trace message when entered
trace kfree_skb+0x12
        Trace the kfree_skb kernel function after the instruction on the 0x12 offset
trace 'do_sys_open "%s", arg2@user'
        Trace the open syscall and print the filename being opened @user is
        added to arg2 in kprobes to ensure that char * should be copied from
        the userspace stack to the bpf stack. If not specified, previous
        behaviour is expected.

trace 'do_sys_open "%s", arg2@user' -n main
        Trace the open syscall and only print event that process names containing "main"
trace 'do_sys_open "%s", arg2@user' --uid 1001
        Trace the open syscall and only print event that processes with user ID 1001
trace 'do_sys_open "%s", arg2@user' -f config
        Trace the open syscall and print the filename being opened filtered by "config"
trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
        Trace the read syscall and print a message for reads >20000 bytes
trace 'r::do_sys_open "%llx", retval'
        Trace the return from the open syscall and print the return value
trace 'c:open (arg2 == 42) "%s %d", arg1, arg2'
        Trace the open() call from libc only if the flags (arg2) argument is 42
trace 'c:malloc "size = %d", arg1'
        Trace malloc calls and print the size being allocated
trace 'p:c:write (arg1 == 1) "writing %d bytes to STDOUT", arg3'
        Trace the write() call from libc to monitor writes to STDOUT
trace 'r::__kmalloc (retval == 0) "kmalloc failed!"'
        Trace returns from __kmalloc which returned a null pointer
trace 'r:c:malloc (retval) "allocated = %x", retval'
        Trace returns from malloc and print non-NULL allocated buffers
trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
        Trace the block_rq_complete kernel tracepoint and print # of tx sectors
trace 'u:pthread:pthread_create (arg4 != 0)'
        Trace the USDT probe pthread_create when its 4th argument is non-zero
trace 'u:pthread:libpthread:pthread_create (arg4 != 0)'
        Ditto, but the provider name "libpthread" is specified.
trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
        Trace the nanosleep syscall and print the sleep duration in ns
trace -c /sys/fs/cgroup/system.slice/workload.service '__x64_sys_nanosleep' '__x64_sys_clone'
        Trace nanosleep/clone syscall calls only under workload.service
        cgroup hierarchy.
trace -I 'linux/fs.h' \
      'p::uprobe_register(struct inode *inode) "a_ops = %llx", inode->i_mapping->a_ops'
        Trace the uprobe_register inode mapping ops, and the symbol can be found
        in /proc/kallsyms
trace -I 'kernel/sched/sched.h' \
      'p::__account_cfs_rq_runtime(struct cfs_rq *cfs_rq) "%d", cfs_rq->runtime_remaining'
        Trace the cfs scheduling runqueue remaining runtime. The struct cfs_rq is defined
        in kernel/sched/sched.h which is in kernel source tree and not in kernel-devel
        package.  So this command needs to run at the kernel source tree root directory
        so that the added header file can be found by the compiler.
trace -I 'net/sock.h' \
      'udpv6_sendmsg(struct sock *sk) (sk->sk_dport == 13568)'
        Trace udpv6 sendmsg calls only if socket's destination port is equal
        to 53 (DNS; 13568 in big endian order)
trace -I 'linux/fs_struct.h' 'mntns_install "users = %d", $task->fs->users'
        Trace the number of users accessing the file system of the current task
trace -s /lib/x86_64-linux-gnu/libc.so.6,/bin/ping 'p:c:inet_pton' -U
        Trace inet_pton system call and use the specified libraries/executables for
        symbol resolution.

5 参考资料