eBPF Updates #3: Atomics Operations, Socket Options Retrieval, Syscall Tracing Benchmarks, eBPF in the Supply Chain

22 Jan, 2021

With the festive season, it would seem that eBPF blogging has cooled down a little, and we have fewer items to report this time. But eBPF is getting traction everywhere, so we can be confident that more material will be available for the months to come. Let's wager that 2021 will be full of new features, tutorials, deep dives, commercial news, and good surprises in general. In the meantime, here are all the latest news. Welcome to the third issue of the eBPF Updates, and Happy New Year!

Let's start with some news from companies working on and with eBPF.

Microsoft is working on an eBPF-based monitoring tool for Linux:

We're working on eBPF-based Sysmon for Linux that has same filtering and output schema (where applicable) as Sysmon For Windows. Shooting for a preview in February. pic.twitter.com/l0BTkVXac1
— Mark Russinovich (@markrussinovich) December 20, 2020

Isovalent Looks to Transform Container Networking With eBPF, from Mike Vizard.
This brief, high-level post focuses on the transformations that eBPF brings to container networking and, incidentally, how it led to the creation of Cilium and Isovalent. The author foresees important changes for cloud-native environments in terms of networking, security, and observability.

Kubernetes Podcast - Episode #133: Cilium, with Thomas Graf, from Craig Box and Adam Glick.
Interviewed by Craig and Adam, Thomas Graf recounts the advent of eBPF, and how it introduced a new paradigm for network processing at a time when the other firewalling solutions would derive their behavior from hardware and would fail to scale. The episode goes on with explanations on how Cilium and Hubble were built on top of eBPF to provide network policies and monitoring for clusters. More advanced questions on the relationship with Envoy, or on a possible support for eBPF on Windows, follow. “Twenty-two years on, do you think you finally fixed networking?” Future will tell.

Optimyze.cloud announced in a tweet that they are working on a “Full-system lightweight continuous profiling for Linux Kernel, C/C++, Rust, Golang, Python, JVM, PHP (with Ruby and Node planned for the future)”, apparently based on eBPF.

Securing Containerized Environments with eBPF, from Adam LeWinter.
Following the transition from physical hardware to virtual machines, most workflows are now moving to containers. In this context, new challenges in terms of visibility and security arise, Adam explains. Cilium leverages eBPF to provide network routing and observability. It combines metadata from layers 3 and 4 with layer 7 parameters such as the HTTP method, bringing “visibility and enforcement based on a service, pod, or container identity”.

BPF: The future of configs, from Thomas Habets.
While many presentations focus on introducing eBPF's technical aspects, this post takes a step back and describes why eBPF is “such a big deal”. For network packet processing in particular, older frameworks (ipfwadm, ipchains, iptables, nftables) are all about configuration, about feeding data for the tables. By contrast, eBPF is about code and programming. This brings up numerous possible use cases: packet routing, filesystem access control, customization of TCP parameters, and more. This is a good read to help understand what is at stake with eBPF.

Linux Networking - eBPF, XDP, DPDK, VPP - What does all that mean?, from Andree Toonk.
Definitely oriented towards networking, this video introduces several of the frameworks that have been used in the domain of fast networking over the last years. The presentation is organized as a “journey”, that the presenter experienced: From the quest for a fast traffic generator, that brought in DPDK, to VPP and ultimately to eBPF and XDP, where you get fast packet processing capabilities while keeping at hand all the features and goodness from the Linux kernel. Andree also covers this topic on his blog.

File Integrity Monitoring using eBPF, from Sylvain Baubeau.
After a brief introduction to eBPF, this post explains how it can improve features such as File Integrity Monitoring as implemented in the Datadog Agent. Several challenges came up, such as portability across kernel versions, monitoring of all hard links pointing to a given file, or performance overhead. eBPF addresses most of them, and provides performance and over overall improvements to the feature.

Introduction to eBPF, from Matt Oswalt.
This high-level introduction—technical details are left out for follow-up posts—explains what eBPF is, how it augments the Linux operating system, and why people care about it. The technology permits fast updates of the kernel's behavior, without the need to wait for patches to go upstream, or without even the need to reboot a system. It brings flexibility, because you can compile your program on-the-fly, and include just the features you need. Every technologist, the post claims, should be aware of eBPF and the changes that come with it, because it will soon be part of the “supply chain”. Just like the first link of this section, read this post if you are getting started with eBPF and want to understand the stakes.

Cilium documentation on The Kubernetes Networking Guide, from Michael Kashin.
The Kubernetes Networking Guide aims at providing “an overview of various Kubernetes networking components, with a focus on exactly how they implement the required functionality”. The newly added Cilium's overview is interesting in that it explains how Cilium deploys and uses eBPF programs and maps, and how to manipulate and inspect them. The document also details how to track a packet at the different stages of its processing in the datapath. This is a recommended read if you want a glimpse of advanced networking with eBPF.

How to Trace Linux System Calls in Production with Minimal Impact on Performance, from Wenbo Zhang.
The answer to the question in the title, as you can imagine, is eBPF. The post explains that strace is good to inspect system calls, but not usable in practice in production environments due to its overhead. Instead, the perf tool, relying on eBPF for some features, is much better suited. In environments with containers using cgroup v2, the eBPF-based tool traceloop comes handy. A benchmark of the different profilers mentioned in the post is provided in the last section.

ipftrace2, a tool to track packets inside the Linux kernel, got a new v0.1.0 release, bringing new features such as support for writing extension module with C thanks to CO-RE, thus improving programmability and portability.

tc-skeleton is a simple example project to demonstrate how to load eBPF programs with go-tc, a work in progress version of tc (the Linux tool for traffic control) written in Go.
lnetd-host-encap is an experiment where an eBPF program encapsulate packets with MPLS headers.

Below are some highlights for the first pull request for the bpf-next tree for the 5.12 cycle:

Add atomic operations to eBPF. To that end, extend the eBPF instruction set with a new BPF_ATOMIC mode modifier for the operation codes. Atomics come along with support for the x86-64 eBPF JIT (support for other JITs is left to developers more familiar with their architectures). Here is a summary of the new instructions:
- atomic[64]_[fetch_]add
- atomic[64]_[fetch_]and
- atomic[64]_[fetch_]or
- atomic[64]_xchg
- atomic[64]_cmpxchg
The motivation was to generate globally-unique cookies in eBPF programs, but these atomic operations are likely to prove useful to a number of other applications. (Brendan Jackman, link)
Support for kernel module global variables (__ksym externs) in eBPF programs. This is a follow-up improvement on the recent support for BTF for kernel modules, to have BTF-powered raw tracepoints or tracing hooks available for modules. (Andrii Nakryiko, link)
Generalize eBPF stackmap's build-id retrieval and add support to have build-ids stored in mmap2 event for perf (This event generates an extended executable mmap record that contains enough additional information to uniquely identify shared mappings, see perf_event_open man page). (Jiri Olsa, link)
Support retrieval of more SOL_SOCKET level options from sock_addr eBPF programs, to fill the gap between the list of options that bpf_setsockopt() can set and those that bpf_getsockopt() could retrieve. The concerned options are:
- SO_MARK
- SO_PRIORITY
- SO_BINDTOIFINDEX (also new for bpf_setsockopt())
(Daniel Borkmann, link)
Improve out-of-tree cross-building for eBPF selftests. Although this adds no new feature, it feels interesting to report because it should enable wider automated testing on ARM architectures. Selftests are, of course, an essential part of the eBPF ecosystem. (Jean-Philippe Brucker, link)

eBPF objects, such as a program or a map, reside in kernel memory until they are no longer needed. Internally, the kernel uses reference counters to keep track of the number of “handles” pointing to such objects. When the number of references comes down to zero, the program or the map is destroyed. The references to a program would typically be a hook where the user has attached the program (such as a tc filter or a kernel probe), or file descriptors that were returned from the kernel when loading the program with the bpf() system call. Similarly, references to an eBPF map can be held by eBPF programs using the map or by a user program that retrieved a file descriptor.

As a consequence, if a process loaded an eBPF program without attaching it, the program will be destroyed when the process exits and its file descriptors are closed. There are ways to share file descriptors between processes, but to make it easier to reference eBPF objects between user applications, or simply to make them persistent at a time when they have no reference in the kernel (such as a detached program or an unused map), another mechanism has been created: the eBPF virtual filesystem.

The eBPF virtual (or pseudo) filesystem, often called bpffs, is traditionally mounted at /sys/fs/bpf, but any alternative mount point can work. It is possible to pin objects to this virtual filesystem, which is rendered as file paths. Calling the bpf() system call with its BPF_OBJ_PIN subcommand pins an object. Then, using the BPF_OBJ_GET subcommand on a bpffs path retrieves a file descriptor to this pinned object. Removing a pinned path simply involves a call to unlink(), just like for regular paths. Pinned paths (and the eBPF objects they reference) are not persistent after reboot.

Note that the use of periods (.) in pinned paths is restricted. The glyph has long been unused, but a recent feature introduced it to mark paths to specific eBPF iterators that the system can preload, maps.debug and progs.debug (but let's keep this for another time). You can have any other character allowed in UNIX names. Yes, /sys/fs/bpf/🐝 is a valid path.

Here is a concrete example. We create an eBPF map with bpftool. Because no program uses the map yet, the only reference created is a file descriptor, which is closed when bpftool exits. To avoid losing the map at this stage, bpftool takes a path name and will use it to pin the map.

# bpftool map create /sys/fs/bpf/🍯 type array key 4 value 32 entries 8 name honeypot
# bpftool --bpffs map show pinned /sys/fs/bpf/🍯
42: array  name foo  flags 0x0
        key 4B  value 32B  max_entries 8  memlock 4096B
        pinned /sys/fs/bpf/🍯

We can then reuse this map when loading a program:

# bpftool prog load bee.o /sys/fs/bpf/🐝 map name honeypot pinned /sys/fs/bpf/🍯

Of course, you do not have to use emojis. More information on the virtual eBPF filesystem is available (although somewhat scattered) in the BPF and XDP Reference Guide. A post called Lifetime of BPF objects, from Alexei Starovoitov, is an excellent resource to learn more about how eBPF objects are managed in the kernel. More information on bpftool usage is available from the dedicated man pages.

Note that there are a few other eBPF objects (BTF, links, iterators) and that some of them are not handled exactly in the same manner. There are also other ways to reference programs and maps, such as references in program array maps or maps of maps.

The enthusiasm about eBPF is strong:

The more I dig into eBPF the more it feels like there's a platform shift coming to parts of the market. The tone is reminiscent of early discussions around LXC and Docker. Exciting to watch and see what happens next!
— Alex Salazar (@TheMostlyGreat) December 19, 2020

eBPF has to be one of the most fun things I’ve tinkered with in a while. The possibilities are really exciting! #infosec #informationsecurity #linux #blueteam
— Deathzone707 (@dunn707) January 22, 2021

Below is a thread asking how to improve eBPF usability:

What would make eBPF more accessible/usable to you? 👇
— Jaana Dogan ヤナドガン (@rakyll) December 27, 2020

In its annual predictions for the year to come, LWN.net foresees that eBPF should be used in an increasing number of products and services in 2021, although at the cost of implementing functionalities separately from the kernel.

And we should apparently expect several eBPF-related talks at the next KubeCon, for which the agenda has not been published yet.

eBPF Updates are brought to you by the Cilium project. This report was produced by Quentin Monnet (Isovalent). Thanks to Cilium engineering team for input and reviews.

If you would like to submit contributions for the next report, please submit them via the #ebpf-news channel on eBPF Slack.

Share on social media:

Subscribe to bi-weekly eCHO News

Keep up on the latest news and information from the eBPF and Cilium