sunfishcode's blog
A blog by sunfishcode

Thread support in Mustang

Posted on

Mustang, a system for running Rust programs entirely written in Rust, has make a lot of progress since the last blog post:

Mustang's thread library is a chance to explore the role of a thread library, the special syscalls and registers that only a thread library uses, and the interaction between a thread library and other system calls. The rest of this blog post takes a closer look.

I'd like to thank Steed's pthread implementation for the initial inspiration here, and demonstrating how to use clone, futex, and the platform thread register.

So, what does a thread library do?

Create threads

First, what is a thread?

A thread consists of an OS thread, which on Linux is just a special kind of process, and some data: metadata, user TLS data, stack memory, and a stack guard. Creating a thread involves allocating memory for the data, and then creating an OS thread configured to use them.

Mustang's origin crate allocates all the memory for a thread in a single contiguous anonymous mmap allocation. Most platforms have a special "Thread Pointer" register which is used to point to thread-specific data, %fs on x86_64, %gs on x86, tpidr_el0 on aarch64, tp on RISC-V, and so on, and we use this to point to this allocation. Compiled code uses this register to locate TLS data, and origin uses this register to locate its metadata.

Thread Layout

The specific layout differs between architectures, for example with the metadata being located before or after the user TLS data. And things would be much more involved if we were talking about dynamic linking. But this covers the basics.

Once the data is allocated and initialized, it's time to create the OS thread.

The clone system call

Origin creates an OS thread using the clone system call. Linux considers this to be similar to fork, which creates new processes, but with more options. A new thread is a new execution context, with its own CPU register state, similar to a new process. The main thing that makes it a thread instead of a process is that it happens to use the same virtual address space and other parts as its parent.

Fork vs. Clone

Like fork, clone is invoked by one execution context, and when it returns there are two execution contexts, one for the parent and one for the child.

Unlike fork though, when the child shares virtual memory with the parent, it can't continue on the same stack, so we pass clone the pointer to the stack we allocated above for the child to run on.

The child starts execution by returning from the clone call, and the return value of the clone call is 0 in the child, and the child's thread id in the parent, so the code after the clone call can test whether it's running in the child or the parent.

The clone system call doesn't have a built-in way of passing any arguments to the child to tell is what to do. To do that, we place additional arguments in registers that we know the system call doesn't clobber, effectively "passing" them to the child.

To illustrate, here's an illustrated version of the RISC-V code for this. Other architectures do equivalent things, though eg. x86 has more ABI details in play.

// Place syscall arguments in a0, a1, ...,
// and `__NR_clone` in a7.
// And place the function pointer for the child to call, and
// the argument to pass to it, in a8 and a9, which are not
// used by the syscall, but not clobbered either.

ecall                // Do the `clone` system call.
bnez a0, 0f          // If we're in the parent, branch to 0:

// We're in the child! Move the argument and function
// pointer into calling-convention registers for the call to
// `entry`.
mv a0, a8            // `arg`
mv a1, a9            // `fn_`

// Zero out the frame address and return address. The call
// below will never return.
mv fp, zero
mv ra, zero

// Call into our Rust code which will call the callee,
// passing it the argument. The Rust code will terminate
// the thread and not return here.
tail entry

// We're in the parent! Return from the system call in the
// normal way.

origin's philosophy is to do as little as possible in assembly, so this code just sets the frame pointer and return address to null, to ensure that we never try to return back to our little assembly fragment, moves the arguments into place for the platform calling-convention argument registers, and calls into Rust code, specifically the threads::entry function, to do the rest.

As an aside, this is very similar to the code origin uses for the start of the process, _start, which calls program::entry function. The OS doesn't start a process in a manner compatible with the platform calling-convention, so we use a minimal amount of assembly to set up a call into Rust code, also including setting the return address and frame pointer to null.

In both threads and processes, we always exit by calling the exit (for threads) or exit_group (for the process) system calls, and never by returning to assembly code.

Join threads

When one thread wants to wait for another to exit, it performs a "join".

Once the other thread exits, the joining thread is woken up, and it frees any resources associated with the other thread, and collects the return values (though Rust's std::thread implementation doesn't use thread return values as such, so the return value part isn't implemented in Mustang yet).

In the clone system call that we called in create_thread above, one of the flags we pass is CLONE_CHILD_CLEARTID. This tells Linux to clear the child thread id and wake up any futexes that are waiting on that memory location when the child exits. It's important that it does both, so that the parent's futex call can avoid waiting if the child has exited first.

If the child hasn't exited yet, join_thread's futex call waits until the child exits and Linux wakes it up. If the thread wasn't detached (more on that below), joining then frees the thread's memory—its stack, TLS data, and metadata.

In the code, this is join_thread, which calls wait_for_thread_exit, and then free_thread_memory.

Run thread destructors (registered with __cxa_thread_atexit_impl)

Variables declared with the thread_local macro can have Drop implementations, and those drop functions are called on a thread's copy of the data when the thread exits.

__cxa_thread_atexit_impl is the C ABI function to register a cleanup function to call when a thread exits. It simply pushes the function onto a Vec, and then the thread exits, it calls call_thread_dtors to call the functions in the Vec, in reverse order of their registration, to ensure that object initializations and finalizations are nested.

Detach threads

"Detaching" a thread declares that it will never be "joined" by any other thread, so it needs to release its resources on its own when it exits.

This is a fairly straightforward matter of just doing bookkeeping to keep track of whether a thread is in the detached state, and freeing resources when it exits if it is, but there are a few catches.

One is that we create threads with clone with the CLONE_CHILD_CLEARTID flag, which tells Linux to zero out our tid field when the child exits. That helps join_thread, but for detached threads, it means that if we free our memory before we exit, the kernel will still think it needs to clear the tid field. Something else in the program could reuse our free'd memory, and it could get corrupted if Linux clears out what is no longer our tid field. So when a thread is marked detached, we call the set_tid_address system call, passing it a null pointer, which effectively disables the CLONE_CHILD_CLEARTID for the thread.

The other is that a thread's memory is freed by an munmap system call, but if we call the usual munmap function that wraps the system call, it will free the current thread, including the stack we're running on out from underneath us. The munmap wrapper will then access freed memory trying to read its return address for the return jump.

To fix this, we need more assembly code. We need a special code sequence to make a munmap system call, and then make an exit system call to exit the thread, without touching the stack after the munmap. Fortunately, this can be done in just a handful of instructions.

Note that on 32-bit x86, we normally don't make syscalls directly, because the usual int 0x80 mechanism is very slow, and Linux provides a much faster way to make syscalls through the vDSO. However, we can't use that here because the vdso is effectively another syscall wrapper that expects to store a return address on the stack. So we use int 0x80 so that we can completely prevent touching the stack. Fortunately, threads don't exit all that often, so the performance loss here isn't that important.

Synchronize threads

It's common for threading libraries to provide synchronization primitives such as mutexes and reader-writer locks.

Mustang's main strategy is to use parking_lot to implement these, since parking_lot is a widely-used library, and there's not a lot to add here.

Except there's one place Mustang can't yet use parking_lot: the Rust global allocator uses a Mutex, because it's allocating memory for multiple threads from a shared heap. But the twist for Mustang is, parking_lot's Mutex does global allocation.

Uh oh.

Allocator tries to acquire a lock, which tries to perform an allocation, which tries to acquire a lock, which tries to perform an allocation...


So what do we do? For now, Mustang has its own Mutex implementation, built on top of atomics and Linux's futex. It's nothing fancy, it's unfair, it's inefficient in a bunch of ways, and it's certainly not proven correct. But it's pretty simple and it passes all the tests! This is your regular reminder that simply implementing everything in Rust does not make anything automatically safer. Mustang is still experimental at this point, with lots of unsafe.

But not cancellation

The other really big piece of functionality that a libpthread would provide is cancellation. Cancellation is very complex, and requires hooks in libc and very special handling of many system calls. However, Rust doesn't support cancellation, and it's relatively rare even in C code, so origin and mustang don't implement it.

Mustang organization

The origin crate provides low-level but still somewhat Rust-idiomatic interfaces to process startup and shutdown, and now also threads. This accompanies the rustix crate which provides Rust-idiomatic interfaces to system calls.

The c-scape crate provides libc and libpthread ABIs as wrappers around rustix and origin. Right now, this allows existing code, such as Rust's std to run on origin and rustix without any extra porting work.

But also, code that wants to can bypass the c-scape compatibility layer, and call into rustix and origin directly. This eliminates some overhead, but more importantly, it offers greater safety and simplicity, because there are fewer raw pointers, raw file descriptors, and raw error return values.

And beyond!

Of course, even rustix and origin are still very low-level. Most users should of course continue to use std, and high-level libraries such as cap-std. But perhaps someday there could be a way of using rustix in std, simplifying the code by factoring out raw pointers, raw file descriptors, and raw error handling.

And perhaps someday, one could even imagine, an official Rust target for Rust programs built entirely in Rust, with I/O safety down to the syscalls.