Writing into uninitialized buffers in Rust
Posted on
Uninitialized buffers in Rust are a long-standing question, for example:
- https://rust-lang.github.io/rfcs/2930-read-buf.html
- https://doc.rust-lang.org/nightly/unstable-book/library-features/core-io-borrowed-buf.html
- https://blog.yoshuawuyts.com/uninit-read-write/
- https://internals.rust-lang.org/t/reading-into-uninitialized-buffers-yet-again/13282/4
Recently, John Nunley and Alex Saveau came up with an idea for a new
approach, using a Buffer
trait, which is now in rustix 1.0, which I'll
describe in this post.
Introducing the Buffer
trait
The POSIX read
function reads bytes from a file descriptor into a buffer,
and it can read fewer bytes than requested. Using Buffer
, read
in
rustix looks like this:
pub fn read<Fd: AsFd, Buf: Buffer<u8>>(fd: Fd, buf: Buf) -> Result<Buf::Output>
This uses the Buffer
trait to describe the buffer argument. The Buffer
trait
looks like this:
pub trait Buffer<T> {
/// The type of the value returned by functions with `Buffer` arguments.
type Output;
/// Return a raw pointer and length to the underlying buffer.
fn parts_mut(&mut self) -> (*mut T, usize);
/// Assert that `len` elements were written to, and provide a return value.
unsafe fn assume_init(self, len: usize) -> Self::Output;
}
(And thanks to Yoshua Wuyts for feedback on this trait and encouragement for the overall idea!)
(Rustix's own Buffer
trait is sealed and its functions are private, but
that's just rustix choosing for now to reserve the ability to evolve the trait
without breaking compatibility, at the expense of not allowing users to use
Buffer
for defining their own I/O functions, for now.)
Buffer
is implemented for &mut [T]
, so users can pass read
a &mut [u8]
buffer to write into, and it'll return a Result<usize>
, where the usize
indicates how many bytes were actually read, on success. This matches how
read
in rustix used to work. Using this looks like:
let mut buf = [0_u8; 16];
let num_read = read(fd, &mut buf)?;
use(&buf[..num_read]);
Buffer
is also implemented for &mut [MaybeUninit<T>]
, so users can pass
read
a &mut [MaybeUninit<u8>]
, and in that case, they'll get back a
Result<(&mut [u8], &mut [MaybeUninit<u8>])>
. On success, that provides a pair
of slices which are subslices of the original buffer, containing the range
of bytes that data was read into, and the remaining bytes that remain
uninitialized. Rustix previously had a function called read_uninit
that
worked this way, and in rustix 1.0 it's replaced by this new Buffer
-enabled
read
function. Using this looks like:
let mut buf = [MaybeUninit::<u8>::uninit(); 16];
let (init, uninit) = read(fd, &mut buf)?;
use(init);
This allows reading into uninitialized buffers with a safe API.
And, Buffer
also supports a way to read into the spare capacity of a Vec
.
The spare_capacity
function takes a &mut Vec<T>
and returns a
SpareCapacity
newtype which implements Buffer
, and it automatically
sets the length of the vector to include the number of initialized elements
after the read
, encapsulating the unsafety of Vec::set_len
. Using this looks like:
let mut buf = Vec::<u8>::with_capacity(1024);
let num_read = read(fd, spare_capacity(&mut buf))?;
use(&buf);
In rustix, all functions that previously took &mut [u8]
buffers to write into
now take impl Buffer<u8>
buffers, so they support writing into uninitialized
buffers.
Under the covers
read
is implemented like this:
let len = unsafe { backend::io::syscalls::read(fd.as_fd(), buf.parts_mut())? };
unsafe { Ok(buf.assume_init(len)) }
First we call the underlying system call, and it returns the number of bytes it
read. We then pass that to assume_init
, which computes the Buffer::Output
to
return. The output may be just that number, or may be a pair of slices reflecting
that number.
What if T
is not u8
?
Buffer
uses a type parameter T
rather than hard-coding u8
, so that it can be
used by functions like epoll::wait
, kevent
, and port::get
to return event
records instead of bytes. Using this can look like this:
let mut event_list = Vec::<epoll::Event>::with_capacity(16);
loop {
let _num = epoll::wait(&epoll, spare_capacity(&mut event_list), None)?;
for event in event_list.drain(..) {
handle(event);
}
}
This drains the Vec
with drain
so that it's empty before each wait
, because
spare_capacity
appends to the Vec
rather than overwriting any elements.
There are no dynamic allocations inside the loop; SpareCapacity
only uses the
existing spare capacity and only calls set_len
, and not resize
.
Alternatively, because Buffer
also works on slices, this code can be written
without using Vec
at all:
let mut event_list = [MaybeUninit::<epoll::Event>; 16];
loop {
let (init, _uninit) = epoll::wait(&epoll, &mut event_list, None)?;
for event in init {
handle(event);
}
}
Error messages
One downside of the Buffer
trait approach is that it sometimes evokes error
messages from rustc which aren't obvious. This happened enough that we now have
a section in rustix's documentation about them, and an example showing examples
where they come up.
Using Buffer
safely
Rust's std currently contains an experimental API based on BorrowedBuf
, which
has the nice property of allowing users to use it without using unsafe
, and without
doing anything hugely inefficient, such as initializing the full buffer. To
achieve this, BorrowedBuf
uses a "double cursor" design to avoid re-initializing
memory that has already been initialized.
The Buffer
trait described here is simpler, avoiding the need for a "double cursor",
however it does have an unsafe
required method. Is there a way we could modify it
to support safe use?
A Cursor
API like BorrowedCursor
could do it. That supports safely and
incrementally writing into an uninitialized buffer. And a key feature of
BorrowedCursor
is that it never requires the full buffer to be eagerly
initialized.
With that, the Buffer
trait might look like:
pub trait Buffer<T> {
// ... existing contents
/// An alternative to `parts_mut` for use with `init`.
///
/// Return a `Cursor`.
fn cursor(&mut self) -> Cursor<T> {
// SAFETY: `parts_mut` guarantees that the pointer and length are valid.
unsafe {
Cursor::new(self.parts_mut())
}
}
/// A safe alternative to `assume_init`.
///
/// Assert that `len` elements were written to, and provide a return value.
fn init(self, done: Cursor) {
// SAFETY: `Cursor` ensures that exactly `written()` bytes have
// been written.
unsafe {
self.assume_init(done.written());
}
}
}
This way, a user could write their own functions that take Buffer
arguments
and implement them using cursor
and init
, without using unsafe
.
Why parts_mut
and a raw pointer?
The parts_mut
function in the Buffer
trait looks like this:
fn parts_mut(&mut self) -> (*mut T, usize);
Why return a raw pointer and length, instead of a &mut [MaybeUninit<T>]
? Because
a &mut [MaybeUninit<T>]
would be unsound in a subtle way. We implement Buffer
for &mut [T]
, which cannot contain any uninitialized elements, and exposing it
as a &mut [MaybeUninit<T>]
would allow uninitialized elements to be written
into it.
With a raw pointer, we put the burden on the assume_init
call to guarantee
that the buffer has been written to properly.
Looking forward
A limited version of this Buffer
trait is now in rustix 1.0, so we'll
see how it goes in practice.
If it works out well, I think this Buffer
design is worth considering for
Rust's std, as a replacement for BorrowedBuf
(which is currently unstable).
It's simpler, as it avoids the "double cursor" pattern, and it has the fun
feature of supporting the Vec
spare capacity use case and encapsulating
the unsafe Vec::set_len
call.