Thread support in Mustang
Mustang, a system for running Rust programs entirely written in Rust, has make a lot of progress since the last blog post:
- New targets: riscv64 and arm, joining x86_64, aarch64, and x86; thanks to @Urgau for arm support in rsix!
- Threading support, including TLS, TLS destructors, and detaching
- Panic and unwind support, thanks to the unwinding crate!
- Math library support, thanks to the libm crate!
- A proper allocator, thanks to the dlmalloc crate!
- No more debugging messages on stderr by default
- DNS (
- Much smaller code size
- Lots more test coverage
Mustang's thread library is a chance to explore the role of a thread library, the special syscalls and registers that only a thread library uses, and the interaction between a thread library and other system calls. The rest of this blog post takes a closer look.
I'd like to thank Steed's pthread implementation for the initial
inspiration here, and demonstrating how to use
futex, and the
platform thread register.
So, what does a thread library do?
First, what is a thread?
A thread consists of an OS thread, which on Linux is just a special kind of process, and some data: metadata, user TLS data, stack memory, and a stack guard. Creating a thread involves allocating memory for the data, and then creating an OS thread configured to use them.
Mustang's origin crate allocates all the memory for a thread in a single
mmap allocation. Most platforms have a special
"Thread Pointer" register which is used to point to thread-specific data,
%fs on x86_64,
%gs on x86,
tpidr_el0 on aarch64,
tp on RISC-V, and
so on, and we use this to point to this allocation. Compiled code uses this
register to locate TLS data, and origin uses this register to locate its
The specific layout differs between architectures, for example with the metadata being located before or after the user TLS data. And things would be much more involved if we were talking about dynamic linking. But this covers the basics.
Once the data is allocated and initialized, it's time to create the OS thread.
clone system call
Origin creates an OS thread using the
clone system call. Linux considers
this to be similar to
fork, which creates new processes, but with more
options. A new thread is a new execution context, with its own CPU register
state, similar to a new process. The main thing that makes it a thread instead
of a process is that it happens to use the same virtual address space and other
parts as its parent.
clone is invoked by one execution context, and when it returns
there are two execution contexts, one for the parent and one for the child.
fork though, when the child shares virtual memory with the parent, it
can't continue on the same stack, so we pass
clone the pointer to the stack
we allocated above for the child to run on.
The child starts execution by returning from the
clone call, and the return
value of the
clone call is 0 in the child, and the child's thread id in the
parent, so the code after the
clone call can test whether it's running in
the child or the parent.
clone system call doesn't have a built-in way of passing any arguments to
the child to tell is what to do. To do that, we place additional arguments in
registers that we know the system call doesn't clobber, effectively "passing"
them to the child.
To illustrate, here's an illustrated version of the RISC-V code for this. Other architectures do equivalent things, though eg. x86 has more ABI details in play.
// Place syscall arguments in a0, a1, ..., // and `__NR_clone` in a7. // // And place the function pointer for the child to call, and // the argument to pass to it, in a8 and a9, which are not // used by the syscall, but not clobbered either. ... ecall // Do the `clone` system call. bnez a0, 0f // If we're in the parent, branch to 0: // We're in the child! Move the argument and function // pointer into calling-convention registers for the call to // `entry`. mv a0, a8 // `arg` mv a1, a9 // `fn_` // Zero out the frame address and return address. The call // below will never return. mv fp, zero mv ra, zero // Call into our Rust code which will call the callee, // passing it the argument. The Rust code will terminate // the thread and not return here. tail entry // We're in the parent! Return from the system call in the // normal way. 0: ...
origin's philosophy is to do as little as possible in assembly, so this code
just sets the frame pointer and return address to null, to ensure that we never
try to return back to our little assembly fragment, moves the arguments into
place for the platform calling-convention argument registers, and calls into
Rust code, specifically the
threads::entry function, to do the rest.
As an aside, this is very similar to the code origin uses for the start of
_start, which calls
program::entry function. The OS
doesn't start a process in a manner compatible with the platform
calling-convention, so we use a minimal amount of assembly to set up a call
into Rust code, also including setting the return address and frame pointer
In both threads and processes, we always exit by calling the
(for threads) or
exit_group (for the process) system calls, and never by
returning to assembly code.
When one thread wants to wait for another to exit, it performs a "join".
Once the other thread exits, the joining thread is woken up, and it frees
any resources associated with the other thread, and collects the return
values (though Rust's
std::thread implementation doesn't use thread return
values as such, so the return value part isn't implemented in Mustang yet).
clone system call that we called in
create_thread above, one of the
flags we pass is
CLONE_CHILD_CLEARTID. This tells Linux to clear the child thread
id and wake up any futexes that are waiting on that memory location when the
child exits. It's important that it does both, so that the parent's futex call
can avoid waiting if the child has exited first.
If the child hasn't exited yet,
futex call waits until the
child exits and Linux wakes it up. If the thread wasn't detached (more on that
below), joining then frees the thread's memory—its stack, TLS data, and
In the code, this is
join_thread, which calls
Run thread destructors (registered with
Variables declared with the
thread_local macro can have
implementations, and those
drop functions are called on a thread's copy
of the data when the thread exits.
__cxa_thread_atexit_impl is the C ABI function to register a cleanup
function to call when a thread exits. It simply pushes the function onto
Vec, and then the thread exits, it calls
call the functions in the
Vec, in reverse order of their registration,
to ensure that object initializations and finalizations are nested.
"Detaching" a thread declares that it will never be "joined" by any other thread, so it needs to release its resources on its own when it exits.
This is a fairly straightforward matter of just doing bookkeeping to keep track of whether a thread is in the detached state, and freeing resources when it exits if it is, but there are a few catches.
One is that we create threads with
clone with the
flag, which tells Linux to zero out our tid field when the child exits.
join_thread, but for detached threads, it means that if we free
our memory before we exit, the kernel will still think it needs to clear the
tid field. Something else in the program could reuse our free'd memory, and
it could get corrupted if Linux clears out what is no longer our tid field.
So when a thread is marked detached, we call the
call, passing it a null pointer, which effectively disables the
CLONE_CHILD_CLEARTID for the thread.
The other is that a thread's memory is freed by an
munmap system call, but
if we call the usual
munmap function that wraps the system call, it will
free the current thread, including the stack we're running on out from
underneath us. The
munmap wrapper will then access freed memory trying to
read its return address for the return jump.
To fix this, we need more assembly code. We need a special code
sequence to make a
munmap system call, and then make an
call to exit the thread, without touching the stack after the
Fortunately, this can be done in just a handful of instructions.
Note that on 32-bit x86, we normally don't make syscalls directly, because
int 0x80 mechanism is very slow, and Linux provides a much
faster way to make syscalls through the vDSO. However, we can't use that
here because the vdso is effectively another syscall wrapper that
expects to store a return address on the stack. So we use
so that we can completely prevent touching the stack. Fortunately,
threads don't exit all that often, so the performance loss here isn't
It's common for threading libraries to provide synchronization primitives such as mutexes and reader-writer locks.
Mustang's main strategy is to use
parking_lot to implement these, since
parking_lot is a widely-used library, and there's not a lot to add here.
Except there's one place Mustang can't yet use
parking_lot: the Rust global
allocator uses a
Mutex, because it's allocating memory for multiple threads
from a shared heap. But the twist for Mustang is,
does global allocation.
Allocator tries to acquire a lock, which tries to perform an allocation, which tries to acquire a lock, which tries to perform an allocation...
So what do we do? For now, Mustang has its own Mutex implementation, built
on top of atomics and Linux's
futex. It's nothing fancy, it's unfair,
it's inefficient in a bunch of ways, and it's certainly not proven correct.
But it's pretty simple and it passes all the tests! This is your regular
reminder that simply implementing everything in Rust does not make anything
automatically safer. Mustang is still experimental at this point, with lots
But not cancellation
The other really big piece of functionality that a libpthread would provide is cancellation. Cancellation is very complex, and requires hooks in libc and very special handling of many system calls. However, Rust doesn't support cancellation, and it's relatively rare even in C code, so origin and mustang don't implement it.
The origin crate provides low-level but still somewhat Rust-idiomatic interfaces to process startup and shutdown, and now also threads. This accompanies the rsix crate which provides Rust-idiomatic interfaces to system calls.
The c-scape crate provides libc and libpthread ABIs as wrappers around
rsix and origin. Right now, this allows existing code, such as Rust's
std to run on origin and rsix without any extra porting work.
But also, code that wants to can bypass the c-scape compatibility layer, and call into rsix and origin directly. This eliminates some overhead, but more importantly, it offers greater safety and simplicity, because there are fewer raw pointers, raw file descriptors, and raw error return values.
Of course, even rsix and origin are still very low-level. Most users should of
course continue to use
std, and high-level libraries such as
cap-std. But perhaps someday
there could be a way of using rsix in std, simplifying the code by factoring
out raw pointers, raw file descriptors, and raw error handling.
And perhaps someday, one could even imagine, an official Rust target for Rust programs built entirely in Rust, with I/O safety down to the syscalls.