eBPF¶
1 术语¶
- 静态插桩:()
- 用户态静态跟踪插桩:(USDT)(user level statically defined tracing)
- 动态跟踪插桩:(kprobe、uprobe)
- BFP字节码:BFP虚拟指令集
- JIT(just in time)编译器:即时编译器,用于编译BFP字节码,比解释器效率高,JIT编译的代码和其它内核函数一样直接在处理器上执行
2 BPF和eBPF¶
1. 一开始扩展的BPF被称作eBPF,但现在BPF指的就是eBPF,内核中实现的BPF也是即支持经典的BFPF和扩展的eBPF。
2. BPF原先指的是伯克利包过滤器(Berkeley Packet Filter),但现在已经不是指网络数据包过滤了,更应该是一种技术,不再认为是Berkeley Packet Filter的缩写
3 BPF¶
BPF通常被称作虚拟机,但JIT编译的代码和其它内核函数一样,直接运行在处理器上。
3.1 BPF的优势¶
- BPF过滤、操作数据直接在内核态进行,而传统工具则将数据拷贝到用户态,相比这种方式,BPF效率和性能更好。
3.2 BPF的限制¶
- BPF程序不能随意调用内核函数;只能调用在API中定义的BPF辅助函数。在后续版本中随着需求的增加,在AP1中会加入更多的辅助函数。BPF程序在执行循环时也有限制:允许BPF将一个无限循环插入kprobes是不安全的,因为这些线程可能还持有重要的锁,从而导致整个系统死锁。解决方法包括循环展开,以及在使用循环的通用场景中增加特定的辅助函数等。Linux53内核支持BPF受限循环,该循环的上限可以通 过验证器验证。
- BPF栈的大小设定为不能超过MAXBPFSTACK,值为512。这个限制在编写BPF观测工具时会碰到,尤其是在往栈上存放多个字符缓冲区时:一个char[256]缓存就可以消耗一半的栈空间。目前并没有增大这个限制的计划。解决方法是使用BPF映射存储空间,映射存储空间是有大小限制的。在bpftrace项目中,将字符串的存储位置从栈空间转移到映射的工作己经开始。
- BPF程序的总指令的数量,最初限制为4096。长的BPF程序有时会碰到这个限制(如果没有LLVM的编译优化,可能会更早碰到这个限制)。Linux5.2内核极大地提升了这个值的上限,使得它不再是一个需要考虑的问题。3BPF验证器的作用是接受一切安全的程序,指令数量限制不应该成为问题。
3.3 BPF结构¶
Linux BPF 运行时( runtime )的各模块的架构所示,它展示了 BPF 指令如何通过 BPF 验证器验证,再由 BPF 虚拟机执行。 BPF 虚拟机的实现既包括一个解释器.又包括一个 JIT 编译器:JIT 编译器负责生成处理器可直接接执行的机器指令。验证器会拒绝那些不安全的操作,这包括针对无界循环的检查: BPF 程序必须在有限的时间内完成。 BPF可以利用辅助函数获取内核状态,利用BPF映射表进行存储。BPF程序在特定事件触发时执行,包括kprobes、uprobes和tracepoint等事件
4 bcc¶
BCC - 基于 BPF 的 Linux IO 分析、网络、监控等工具
4.1 bcc 安装¶
- linux各发行版安装步骤 亲测使用yum安装,各种问题,故采用源码安装方式
- 安装后,相关工具会放在
/usr/share/bcc/tools
目录下,这些工具都是用python编写
[root@centos7 tools]# cd /usr/share/bcc/tools
[root@centos7 tools]# ls
argdist btrfsdist cpuunclaimed doc filetop javaflow llcstat nodegc perlflow pythoncalls rubygc slabratetop syncsnoop tcpconnect tcptracer xfsdist
bashreadline btrfsslower dbslower drsnoop funccount javagc mdflush nodestat perlstat pythonflow rubyobjnew sofdsnoop syscount tcpconnlat tplist xfsslower
biolatency cachestat dbstat execsnoop funclatency javaobjnew memleak offcputime phpcalls pythongc rubystat softirqs tclcalls tcpdrop trace
biosnoop cachetop dcsnoop ext4dist funcslower javastat mountsnoop offwaketime phpflow pythonstat runqlat solisten tclflow tcplife ttysnoop
biotop capable dcstat ext4slower gethostlatency javathreads mysqld_qslower oomkill phpstat reset-trace runqlen sslsniff tclobjnew tcpretrans vfscount
bitesize cobjnew deadlock filelife hardirqs killsnoop nfsdist opensnoop pidpersec rubycalls runqslower stackcount tclstat tcpsubnet vfsstat
bpflist cpudist deadlock.c fileslower javacalls lib nfsslower perlcalls profile rubyflow shmsnoop statsnoop tcpaccept tcptop wakeuptime
4.1.1 内核升级¶
从Linux内核3.15开始,添加了eBPF。更多的功能在Linux内核4.1中,官方推荐使用4.1以上内核版本。
- 参考[[Linux运维#1 5 升级系统内核]],安装内核时需要将header包也安装,否则后面使用bcc工具时会报找不到内核头文件的错误,如yum --enablerepo=elrepo-kernel install kernel-ml kernel-mt-headers kernel-mt-tools kernel-mt-devel -y
4.1.2 安装 gcc¶
- 参考[[Linux软件安装#1 2 gcc]]
4.1.3 安装 cmake¶
- 参考[[Linux软件安装#1 3 cmake]]
4.1.4 安装 llvm 相关组件¶
- llvm-project项目地址
- 此项目下安装包较多,请务必确认下载的是llvm-project,比如llvm-project-14.0.6.src.tar.xz
- 安装
[root@centos7 ~]# wget https://github.com/llvm/llvm-project/releases/download/llvmorg-14.0.6/llvm-project-14.0.6.src.tar.xz
[root@centos7 ~]# tar -xf llvm-project-14.0.6.src.tar.xz
[root@centos7 llvm-project-14.0.6.src]# mkdir build&&cdbuild
[root@centos7 build]# cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_RTTI=ON -DLLVM_ENABLE_PROJECTS="clang;libcxx;libcxxabi" -DLLVM_TARGETS_TO_BUILD="BPF;X86" -G "Unix Makefiles" ../llvm
[root@centos7 build]# make
[root@centos7 build]# make install
4.1.5 安装 bcc 依赖¶
yum install -y epel-release
yum update -y
yum groupinstall -y "Development tools"
yum install -y elfutils-libelf-devel bison flex ncurses-devel
yum install -y luajit luajit-devel # for Lua support
4.1.6 安装 bcc¶
[root@centos7 ~]# wget https://github.com/iovisor/bcc/releases/download/v0.25.0/bcc-src-with-submodule.tar.gz
[root@centos7 ~]# tar -xzf bcc-src-with-submodule.tar.gz
[root@centos7 ~]# cd bcc && mkdir build && cd build
[root@centos7 build]# cmake ..
[root@centos7 build]# make
[root@centos7 build]# make install
4.1.7 问题汇总¶
4.1.7.1 libbpf/src/bpf.h: No such file or directory¶
- 解决办法:从gitlab上下载bcc-src-with-submodule.tar.gz安装
4.1.7.2 ImportError: No module named bc¶
- 问题原因:通常是同时安装了python2和python3,make install部分python依赖装到了python3路径下,但系统默认以python2执行
- 解决办法:
- 方法一:指定python3版本:比如
python3 opensnoop
- 方法二:修改python的软连接,执行python3,而不是python2(下面是具体操作)
- 方法一:指定python3版本:比如
# 查询python二进制所在位置
[root@centos7 tools]# whereis python
python: /usr/bin/python /usr/bin/python2.7 /usr/bin/python3.6 /usr/bin/python3.6m /usr/lib/python2.7 /usr/lib/python3.6 /usr/lib64/python2.7 /usr/lib64/python3.6 /etc/python /usr/include/python2.7 /usr/include/python3.6m /usr/share/man/man1/python.1.gz
[root@centos7 tools]# cd /usr/bin
[root@centos7 bin]# ll |grep python
lrwxrwxrwx. 1 root root 7 Aug 1 03:10 python -> python2
lrwxrwxrwx. 1 root root 9 Aug 1 03:10 python2 -> python2.7
-rwxr-xr-x. 1 root root 7144 Jun 28 23:30 python2.7
lrwxrwxrwx. 1 root root 9 Aug 15 04:17 python3 -> python3.6
-rwxr-xr-x. 2 root root 11328 Nov 17 2020 python3.6
-rwxr-xr-x. 2 root root 11328 Nov 17 2020 python3.6m
# 删除python2的软连接
[root@centos7 bin]# unlink python
# 链接python3
[root@centos7 bin]# ln -s python3 python
[root@centos7 bin]# ll |grep python
lrwxrwxrwx. 1 root root 7 Aug 16 06:56 python -> python3
lrwxrwxrwx. 1 root root 9 Aug 1 03:10 python2 -> python2.7
-rwxr-xr-x. 1 root root 7144 Jun 28 23:30 python2.7
lrwxrwxrwx. 1 root root 9 Aug 15 04:17 python3 -> python3.6
-rwxr-xr-x. 2 root root 11328 Nov 17 2020 python3.6
-rwxr-xr-x. 2 root root 11328 Nov 17 2020 python3.6m
4.2 bcc 工具使用¶
4.2.1 funccount¶
title: 解决的问题?
1. 某个内核态或用户态函数是否被调用过?
2. 该函数每秒被调用了多少次?
funccount 支持模式搜索;支持*通配符;还可以将-r用于正则表达式。
- 功能:统计function、 tracepoint和与模式匹配的USDT探测调用出现的次数。
- 语法:
funccount [options] eventname
options
列表-d DURATION
:跟踪的持续时间(单位秒)-i INTERVAL
:每隔多少秒打印输出-p PID
:跟踪指定的进程-c CPU
:仅在此cpu上跟踪-r
:使用正则表达式
pattern
类别t:(子系统:事件名)
:tracepoint,比如funccount t:kmem:kmalloc
u:
:USDT,比如funccount -p 185 u:node:gc*
c:libc函数
:uprobe(lib函数),比如funccount c:malloc
二进制文件全路径或相对路径:函数符号
:uprobe,比如funccount /root/a.out:_Z15printCallStack2v
内核函数
:kprobe,比如funccount vfs_open
- 官方示例
# 1. Count kernel functions beginning with "vfs_", until Ctrl-C is hit:
$ funccount 'vfs_*'
# 2. Count kernel functions beginning with "tcp_send", until Ctrl-C is hit:
$ funccount 'tcp_send*'
# 3. Print kernel functions beginning with "vfs_", every second:
$ funccount -i 1 'vfs_*'
# 4. Print kernel functions beginning with "vfs_", for ten seconds only:
$ funccount -d 10 'vfs_*'
# 5. Match kernel functions beginning with "vfs_", using regular expressions:
$ funccount -r '^vfs_.*'
# 6. Count vfs calls for process ID 181 only:
$ funccount -p 181 'vfs_*'
# 7. Count calls to the sched_fork tracepoint, indicating a fork() performed:
$ funccount t:sched:sched_fork
# 8. Count all GC USDT probes in the Node process:
$ funccount -p 185 u:node:gc*
# 9. Count all malloc() calls in libc:
$ funccount c:malloc
# 10. Count kernel functions beginning with "vfs_" on CPU 1 only:
$ funccount -c 1 'vfs_*'
4.2.1.1 跟踪用户程序函数¶
- 使用
nm -s
命令查看二进制文件的symbols信息,选择需要的函数名 - 语法:
funccount 二进制文件全路径或相对路径:函数名
[root@centos7 ~]# funccount /root/a.out:_Z15printCallStack2v
Tracing 1 functions for "b'/root/a.out:_Z15printCallStack2v'"... Hit Ctrl-C to end.
^C
FUNC COUNT
_Z15printCallStack2v 1
Detaching...
- 使用模式匹配
*
来表示所有函数 - 语法:
funccount 二进制文件全路径或相对路径:*
- 示例
[root@centos7 ~]# funccount /root/a.out:*
Tracing 15 functions for "b'./a.out:*'"... Hit Ctrl-C to end.
^C
FUNC COUNT
_Z5test3i 1
_fini 1
_Z5test2i 1
main 1
_Z5test1i 1
_init 1
_start 1
__libc_csu_init 1
_Z15printCallStack2v 1
deregister_tm_clones 1
register_tm_clones 1
__do_global_dtors_aux 1
frame_dummy 1
4.2.1.2 跟踪 libc 函数¶
此方式只是针对libc做的特殊处理,同样可以使用上面跟踪内核态函数的方法
- 语法:
funccount c:libc函数
- 示例:跟踪open函数
[root@centos7 ~]# funccount c:open
Tracing 1 functions for "b'c:open'"... Hit Ctrl-C to end.
^C
FUNC COUNT
open 247
Detaching...
- 示例:模式匹配
Tracing 9 functions for "b'c:open*'"... Hit Ctrl-C to end.
^C
FUNC COUNT
opendir 1
open64 12
Detaching...
- 示例:使用内核态函数跟踪通用方法
[root@centos7 perf_example]# funccount /lib64/libc.so.6:opendir
Tracing 1 functions for "b'/lib64/libc.so.6:opendir'"... Hit Ctrl-C to end.
^C
FUNC COUNT
opendir 2
Detaching...
4.2.1.3 跟踪内核函数¶
- 语法:
function 内核函数
- 示例:跟踪指定内核函数
[root@centos7 perf_example]# funccount vfs_open
Tracing 1 functions for "b'vfs_open'"... Hit Ctrl-C to end.
^C
FUNC COUNT
vfs_open 6
Detaching...
- 示例:模糊匹配模式
[root@centos7 perf_example]# funccount vfs_*
Tracing 71 functions for "b'vfs_*'"... Hit Ctrl-C to end.
^C
FUNC COUNT
vfs_statx 1
vfs_fstatat 1
vfs_fsync_range 3
vfs_statfs 9
vfs_write 37
vfs_read 82
Detaching...
4.2.1.4 跟踪 tracepoint¶
- 语法:
funccount t:跟踪点
[root@centos7 events]# funccount t:vsyscall:emulate_vsyscall
Tracing 1 functions for "b't:vsyscall:emulate_vsyscall'"... Hit Ctrl-C to end.
^C
FUNC COUNT
Detaching...
4.2.1.5 跟踪 USDT¶
- 语法:
funccount u:USDT探针*
4.2.2 stackcount¶
title: 解决的问题?
1. 某个事件为什么会被调用?调用的代码路径是什么?
2. 有哪些不同的代码路径会调用该事件,它们的调用频次如何?
- 功能:对导致某事件发生的函数调用栈进行计数。
- 语法:
stackcount [options] eventname
option
列表-s
:栈函数显示地址偏移量-v
:栈函数显示raw地址-d
:内核函数和用户态函数用--分割-U
:只显示用户态栈-K
:只显示内核栈-T
:显示时间
pattern
类别,和funccount一样t:(子系统:事件名)
:tracepoint,比如stackcount t:kmem:kmalloc
u:
:USDT,比如stackcount -p 185 u:node:gc*
c:libc函数
:uprobe(lib函数),比如stackcount c:malloc
二进制文件全路径或相对路径:函数符号
:uprobe,比如stackcount /root/a.out:_Z15printCallStack2v
内核函数
:kprobe,比如stackcount vfs_open
- 示例
[root@centos7 ~]# stackcount vfs_read
Tracing 1 functions for "vfs_read"... Hit Ctrl-C to end.
^C
vfs_read
__x64_sys_read
do_syscall_64
entry_SYSCALL_64_after_hwframe
[unknown]
[unknown]
1
[root@centos7 ~]# stackcount -vs vfs_read
Tracing 1 functions for "vfs_read"... Hit Ctrl-C to end.
^C
ffffffffba3467b1 vfs_read+0x1
ffffffffba346cea __x64_sys_read+0x1a
ffffffffbaab014b do_syscall_64+0x3b
ffffffffbac0009b entry_SYSCALL_64_after_hwframe+0x63
7f976820e75d [unknown]
1
4.2.3 trace¶
title: 解决的问题?
1. 当某个内核态/用户态函数被调用时,调用参数是什么?
2. 这个函数的返回值是什么?调用失败了吗?
3. 这个函数是如何被调用的?相应的用户态或内核态函数调用栈是什么?
- 选项
-h, --help show this help message and exit
-b BUFFER_PAGES, --buffer-pages BUFFER_PAGES
number of pages to use for perf_events ring buffer
(default: 64)
-p PID, --pid PID id of the process to trace (optional)
-L TID, --tid TID id of the thread to trace (optional)
--uid UID id of the user to trace (optional)
-v, --verbose print resulting BPF program code before executing
-Z STRING_SIZE, --string-size STRING_SIZE
maximum size to read from strings
-S, --include-self do not filter trace's own pid from the trace
-M MAX_EVENTS, --max-events MAX_EVENTS
number of events to print before quitting
-t, --timestamp print timestamp column (offset from trace start)
-u, --unix-timestamp print UNIX timestamp instead of offset from trace
start, requires -t
-T, --time print time column
-C, --print_cpu print CPU id
-c CGROUP_PATH, --cgroup-path CGROUP_PATH
cgroup path
-n NAME, --name NAME only print process names containing this name
-f MSG_FILTER, --msg-filter MSG_FILTER
only print the msg of event containing this string
-B, --bin_cmp allow to use STRCMP with binary values
-s SYM_FILE_LIST, --sym_file_list SYM_FILE_LIST
comma separated list of symbol files to use for symbol
resolution
-K, --kernel-stack output kernel stack trace
-U, --user-stack output user stack trace
-a, --address print virtual address in stacks
-I header, --include header
additional header files to include in the BPF program
as either full path, or relative to current working
directory, or relative to default kernel header search
path
-A, --aggregate aggregate amount of each trace
- 官方示例:
trace do_sys_open
Trace the open syscall and print a default trace message when entered
trace kfree_skb+0x12
Trace the kfree_skb kernel function after the instruction on the 0x12 offset
trace 'do_sys_open "%s", arg2@user'
Trace the open syscall and print the filename being opened @user is
added to arg2 in kprobes to ensure that char * should be copied from
the userspace stack to the bpf stack. If not specified, previous
behaviour is expected.
trace 'do_sys_open "%s", arg2@user' -n main
Trace the open syscall and only print event that process names containing "main"
trace 'do_sys_open "%s", arg2@user' --uid 1001
Trace the open syscall and only print event that processes with user ID 1001
trace 'do_sys_open "%s", arg2@user' -f config
Trace the open syscall and print the filename being opened filtered by "config"
trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
Trace the read syscall and print a message for reads >20000 bytes
trace 'r::do_sys_open "%llx", retval'
Trace the return from the open syscall and print the return value
trace 'c:open (arg2 == 42) "%s %d", arg1, arg2'
Trace the open() call from libc only if the flags (arg2) argument is 42
trace 'c:malloc "size = %d", arg1'
Trace malloc calls and print the size being allocated
trace 'p:c:write (arg1 == 1) "writing %d bytes to STDOUT", arg3'
Trace the write() call from libc to monitor writes to STDOUT
trace 'r::__kmalloc (retval == 0) "kmalloc failed!"'
Trace returns from __kmalloc which returned a null pointer
trace 'r:c:malloc (retval) "allocated = %x", retval'
Trace returns from malloc and print non-NULL allocated buffers
trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
Trace the block_rq_complete kernel tracepoint and print # of tx sectors
trace 'u:pthread:pthread_create (arg4 != 0)'
Trace the USDT probe pthread_create when its 4th argument is non-zero
trace 'u:pthread:libpthread:pthread_create (arg4 != 0)'
Ditto, but the provider name "libpthread" is specified.
trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
Trace the nanosleep syscall and print the sleep duration in ns
trace -c /sys/fs/cgroup/system.slice/workload.service '__x64_sys_nanosleep' '__x64_sys_clone'
Trace nanosleep/clone syscall calls only under workload.service
cgroup hierarchy.
trace -I 'linux/fs.h' \
'p::uprobe_register(struct inode *inode) "a_ops = %llx", inode->i_mapping->a_ops'
Trace the uprobe_register inode mapping ops, and the symbol can be found
in /proc/kallsyms
trace -I 'kernel/sched/sched.h' \
'p::__account_cfs_rq_runtime(struct cfs_rq *cfs_rq) "%d", cfs_rq->runtime_remaining'
Trace the cfs scheduling runqueue remaining runtime. The struct cfs_rq is defined
in kernel/sched/sched.h which is in kernel source tree and not in kernel-devel
package. So this command needs to run at the kernel source tree root directory
so that the added header file can be found by the compiler.
trace -I 'net/sock.h' \
'udpv6_sendmsg(struct sock *sk) (sk->sk_dport == 13568)'
Trace udpv6 sendmsg calls only if socket's destination port is equal
to 53 (DNS; 13568 in big endian order)
trace -I 'linux/fs_struct.h' 'mntns_install "users = %d", $task->fs->users'
Trace the number of users accessing the file system of the current task
trace -s /lib/x86_64-linux-gnu/libc.so.6,/bin/ping 'p:c:inet_pton' -U
Trace inet_pton system call and use the specified libraries/executables for
symbol resolution.
5 参考资料¶
- Brendan博客
- Andrii Nakryiko 博客
- https://ebpf.io/zh-cn/
- https://www.brendangregg.com/bpf-performance-tools-book.html