Thread support in Mustang
Posted on
Mustang, a system for running Rust programs entirely written in Rust, has make a lot of progress since the last blog post:
- New targets: riscv64 and arm, joining x86_64, aarch64, and x86; thanks to @Urgau for arm support in rustix!
- Threading support, including TLS, TLS destructors, and detaching
- Panic and unwind support, thanks to the unwinding crate!
- Math library support, thanks to the libm crate!
- A proper allocator, thanks to the dlmalloc crate!
- No more debugging messages on stderr by default
- DNS (
ToSocketAddrs
) support - Much smaller code size
- Lots more test coverage
Mustang's thread library is a chance to explore the role of a thread library, the special syscalls and registers that only a thread library uses, and the interaction between a thread library and other system calls. The rest of this blog post takes a closer look.
I'd like to thank Steed's pthread implementation for the initial
inspiration here, and demonstrating how to use clone
, futex
, and the
platform thread register.
So, what does a thread library do?
Create threads
First, what is a thread?
A thread consists of an OS thread, which on Linux is just a special kind of process, and some data: metadata, user TLS data, stack memory, and a stack guard. Creating a thread involves allocating memory for the data, and then creating an OS thread configured to use them.
Mustang's origin crate allocates all the memory for a thread in a single
contiguous anonymous mmap
allocation. Most platforms have a special
"Thread Pointer" register which is used to point to thread-specific data,
%fs
on x86_64, %gs
on x86, tpidr_el0
on aarch64, tp
on RISC-V, and
so on, and we use this to point to this allocation. Compiled code uses this
register to locate TLS data, and origin uses this register to locate its
metadata.
The specific layout differs between architectures, for example with the metadata being located before or after the user TLS data. And things would be much more involved if we were talking about dynamic linking. But this covers the basics.
Once the data is allocated and initialized, it's time to create the OS thread.
The clone
system call
Origin creates an OS thread using the clone
system call. Linux considers
this to be similar to fork
, which creates new processes, but with more
options. A new thread is a new execution context, with its own CPU register
state, similar to a new process. The main thing that makes it a thread instead
of a process is that it happens to use the same virtual address space and other
parts as its parent.
Like fork
, clone
is invoked by one execution context, and when it returns
there are two execution contexts, one for the parent and one for the child.
Unlike fork
though, when the child shares virtual memory with the parent, it
can't continue on the same stack, so we pass clone
the pointer to the stack
we allocated above for the child to run on.
The child starts execution by returning from the clone
call, and the return
value of the clone
call is 0 in the child, and the child's thread id in the
parent, so the code after the clone
call can test whether it's running in
the child or the parent.
The clone
system call doesn't have a built-in way of passing any arguments to
the child to tell is what to do. To do that, we place additional arguments in
registers that we know the system call doesn't clobber, effectively "passing"
them to the child.
To illustrate, here's an illustrated version of the RISC-V code for this. Other architectures do equivalent things, though eg. x86 has more ABI details in play.
// Place syscall arguments in a0, a1, ...,
// and `__NR_clone` in a7.
//
// And place the function pointer for the child to call, and
// the argument to pass to it, in a8 and a9, which are not
// used by the syscall, but not clobbered either.
...
ecall // Do the `clone` system call.
bnez a0, 0f // If we're in the parent, branch to 0:
// We're in the child! Move the argument and function
// pointer into calling-convention registers for the call to
// `entry`.
mv a0, a8 // `arg`
mv a1, a9 // `fn_`
// Zero out the frame address and return address. The call
// below will never return.
mv fp, zero
mv ra, zero
// Call into our Rust code which will call the callee,
// passing it the argument. The Rust code will terminate
// the thread and not return here.
tail entry
// We're in the parent! Return from the system call in the
// normal way.
0:
...
origin's philosophy is to do as little as possible in assembly, so this code
just sets the frame pointer and return address to null, to ensure that we never
try to return back to our little assembly fragment, moves the arguments into
place for the platform calling-convention argument registers, and calls into
Rust code, specifically the threads::entry
function, to do the rest.
As an aside, this is very similar to the code origin uses for the start of
the process, _start
, which calls program::entry
function. The OS
doesn't start a process in a manner compatible with the platform
calling-convention, so we use a minimal amount of assembly to set up a call
into Rust code, also including setting the return address and frame pointer
to null.
In both threads and processes, we always exit by calling the exit
(for threads) or exit_group
(for the process) system calls, and never by
returning to assembly code.
Join threads
When one thread wants to wait for another to exit, it performs a "join".
Once the other thread exits, the joining thread is woken up, and it frees
any resources associated with the other thread, and collects the return
values (though Rust's std::thread
implementation doesn't use thread return
values as such, so the return value part isn't implemented in Mustang yet).
In the clone
system call that we called in create_thread
above, one of the
flags we pass is CLONE_CHILD_CLEARTID
. This tells Linux to clear the child thread
id and wake up any futexes that are waiting on that memory location when the
child exits. It's important that it does both, so that the parent's futex call
can avoid waiting if the child has exited first.
If the child hasn't exited yet, join_thread
's futex
call waits until the
child exits and Linux wakes it up. If the thread wasn't detached (more on that
below), joining then frees the thread's memory—its stack, TLS data, and
metadata.
In the code, this is join_thread
, which calls wait_for_thread_exit
, and
then free_thread_memory
.
Run thread destructors (registered with __cxa_thread_atexit_impl
)
Variables declared with the thread_local
macro can have Drop
implementations, and those drop
functions are called on a thread's copy
of the data when the thread exits.
__cxa_thread_atexit_impl
is the C ABI function to register a cleanup
function to call when a thread exits. It simply pushes the function onto
a Vec
, and then the thread exits, it calls call_thread_dtors
to
call the functions in the Vec
, in reverse order of their registration,
to ensure that object initializations and finalizations are nested.
Detach threads
"Detaching" a thread declares that it will never be "joined" by any other thread, so it needs to release its resources on its own when it exits.
This is a fairly straightforward matter of just doing bookkeeping to keep track of whether a thread is in the detached state, and freeing resources when it exits if it is, but there are a few catches.
One is that we create threads with clone
with the CLONE_CHILD_CLEARTID
flag, which tells Linux to zero out our tid field when the child exits.
That helps join_thread
, but for detached threads, it means that if we free
our memory before we exit, the kernel will still think it needs to clear the
tid field. Something else in the program could reuse our free'd memory, and
it could get corrupted if Linux clears out what is no longer our tid field.
So when a thread is marked detached, we call the set_tid_address
system
call, passing it a null pointer, which effectively disables the
CLONE_CHILD_CLEARTID
for the thread.
The other is that a thread's memory is freed by an munmap
system call, but
if we call the usual munmap
function that wraps the system call, it will
free the current thread, including the stack we're running on out from
underneath us. The munmap
wrapper will then access freed memory trying to
read its return address for the return jump.
To fix this, we need more assembly code. We need a special code
sequence to make a munmap
system call, and then make an exit
system
call to exit the thread, without touching the stack after the munmap
.
Fortunately, this can be done in just a handful of instructions.
Note that on 32-bit x86, we normally don't make syscalls directly, because
the usual int 0x80
mechanism is very slow, and Linux provides a much
faster way to make syscalls through the vDSO. However, we can't use that
here because the vdso is effectively another syscall wrapper that
expects to store a return address on the stack. So we use int 0x80
so that we can completely prevent touching the stack. Fortunately,
threads don't exit all that often, so the performance loss here isn't
that important.
Synchronize threads
It's common for threading libraries to provide synchronization primitives such as mutexes and reader-writer locks.
Mustang's main strategy is to use parking_lot
to implement these, since
parking_lot
is a widely-used library, and there's not a lot to add here.
Except there's one place Mustang can't yet use parking_lot
: the Rust global
allocator uses a Mutex
, because it's allocating memory for multiple threads
from a shared heap. But the twist for Mustang is, parking_lot
's Mutex
does global allocation.
Uh oh.
Allocator tries to acquire a lock, which tries to perform an allocation, which tries to acquire a lock, which tries to perform an allocation...
💥
So what do we do? For now, Mustang has its own Mutex implementation, built
on top of atomics and Linux's futex
. It's nothing fancy, it's unfair,
it's inefficient in a bunch of ways, and it's certainly not proven correct.
But it's pretty simple and it passes all the tests! This is your regular
reminder that simply implementing everything in Rust does not make anything
automatically safer. Mustang is still experimental at this point, with lots
of unsafe
.
But not cancellation
The other really big piece of functionality that a libpthread would provide is cancellation. Cancellation is very complex, and requires hooks in libc and very special handling of many system calls. However, Rust doesn't support cancellation, and it's relatively rare even in C code, so origin and mustang don't implement it.
Mustang organization
The origin crate provides low-level but still somewhat Rust-idiomatic interfaces to process startup and shutdown, and now also threads. This accompanies the rustix crate which provides Rust-idiomatic interfaces to system calls.
The c-scape crate provides libc and libpthread ABIs as wrappers around
rustix and origin. Right now, this allows existing code, such as Rust's
std
to run on origin and rustix without any extra porting work.
But also, code that wants to can bypass the c-scape compatibility layer, and call into rustix and origin directly. This eliminates some overhead, but more importantly, it offers greater safety and simplicity, because there are fewer raw pointers, raw file descriptors, and raw error return values.
And beyond!
Of course, even rustix and origin are still very low-level. Most users should of
course continue to use std
, and high-level libraries such as
cap-std. But perhaps someday
there could be a way of using rustix in std, simplifying the code by factoring
out raw pointers, raw file descriptors, and raw error handling.
And perhaps someday, one could even imagine, an official Rust target for Rust programs built entirely in Rust, with I/O safety down to the syscalls.