Bridging between source languages, in Wasm

Posted on May 23, 2024

One of the core design goals for Wasm is to support code compiled from many different programming languages.

To this end, core Wasm's type system is very low-level. It's designed in view of the fact that programming languages all have their own ways of doing things. Even seemingly simple things like strings or dictionaries can have different semantics or performance tradeoffs between different languages. Instead of imposing one answer on every language, Wasm gives languages the flexibility to make their own choices.

That also means that whenever we have a Wasm program written in one language, and we want it to be able to talk to another Wasm program, or a host, written in another language, we need a way to bridge between those different languages.

What are the options?

Looking outside of Wasm at how other systems provide for cross-language interfaces, there are roughly three different categories of approaches:

Point-to-point
Language-family
All-to-all

Point-to-point

Point-to-point means connecting one specific language to another specific language. For example, PyO3 is a system for connecting Rust and Python. With PyO3, one might write Rust code, that talks to Python code, like this:

use pyo3::prelude::*;

/// Formats the sum of two numbers as string.
#[pyfunction]
fn sum_as_string(a: usize, b: usize) -> PyResult<String> {
    Ok((a + b).to_string())
}

In point-to-point systems, at least one of the two sides knows what language it's talking to. With PyO3, it's Rust code that knows it's talking to Python code. That Rust code wouldn't be able to talk to any other language. The upside of that tradeoff is that by being specialized to specific languages, point-to-point systems can provide rich integration between languages. PyO3 can make much of the expressivity of Python available to the Rust code, and it can do so efficiently.

But the downside from a Wasm perspective is that we specifically don't want interfaces where one side knows the source language of the other. The source language of the guest shouldn't have to be aware of the source language of the host.

Language-family

Language-family systems connect a family of languages that all share something in common with each other. For example, on the Web, many different JavaScript-like languages can talk to each other by having both sides pretend to be JavaScript. On Unix-like operating systems, many different C-like languages can talk to each other by having both sides pretend to be C. And so on.

Language-family systems tend to provide somewhat less rich integration than point-to-point systems. You can't access every feature of Python if you don't know that you're talking to Python, for example. Any unique feature of any language that isn't common to the family as a whole tends to be hidden.

For example, ClojureScript code can talk to any JavaScript-family language by pretending to be JavaScript. JavaScript doesn't have the same types as ClojureScript, so ClojureScript provides a function named clj->js to convert ClojureScript values to JavaScript values, and it's lossy:

(clj->js [:red "green" 'blue])
;;=> #js ["red" "green" "blue"]

This uses a feature of ClojureScript called symbols, which JavaScript doesn't have. When symbols get converted to JavaScript, they need to be converted into something JavaScript does have, such as strings. This allows ClojureScript programs to talk to any other language that thinks its talking to JavaScript. But, it means that ClojureScript can't expect other languages to pass it back symbols. There's less integration.

And at the same time, languages that aren't in the language family can't fully participate.

All-to-all

All-to-all systems are designed to connect any language to any other language. Each side is completely unaware of the language of the other side. Often this is done using an Interface Description Language (IDL), so that the interface between languages can be described in a language-independent way.

All-to-all systems are common in RPC protocols, such as network protocols, where it's especially desirable to be able to implement clients and servers in different languages.

IDLs also have a lot of similarity with database schema languages. Both need to define datatypes, and both typically have a strong need to keep the data independent of the programming languages that will produce or consume the data. And, both have a need for the data to be meaningful without existing within a particular address space or a GC heap with an arbitrary reference graph.

What Would Wasm Want?

First of all, there is no one answer that's best for all situations. There are tradeoffs in each of these three categories, so no single cross-language interface system will work best for all situations. All three should, and can, coexist within the Wasm ecosystem. C developers can use .a archives containing C code that links to other C code via C ABIs. As Wasm GC matures, perhaps there will be conventional ways to link Java libraries compiled to Wasm with other Java libraries, allowing any language in the Java language family to be linked together with Java-level integration.

At the same time, there is a need for an over-arching all-to-all system. Point-to-point and language-family approaches can be nested inside of the over-arching system to provide greater integration where needed. The all-to-all approach is the only way to ensure that every language can participate in the overall ecosystem, without having to be a member of the right language family, without having a blessed language that every language has to pretend to be.

This is one of the unique opportunities for Wasm. In contrast, Unix, the JVM, the CLR, and JavaScript are all language-family platforms. They each start with their respective blessed language, and oblige all other languages to talk to each other by pretending to be that blessed language.

Wasm, on the other hand, doesn't have an inherent blessed language.

Ok wait. Wait just a minute. Everyone knows that C is the blessed language

The Wasm MVP explicitly focused on C/C++. Many people's first introduction to Wasm was filled with pointers and offsets and struct layouts. It gave a lot of people the impression that Wasm was settling into its place within the grand Unix tradition of using C as its blessed language of communication. Not everyone may be happy about that, but a lot of people aren't surprised by it. If you had asked me before I started working on Wasm where it would go, C ABIs are what I would have guessed it would use.

And in many places in the computing world today; this was seen as inevitable, as C is seen as the universal language. After all, some people observe, all information on a computer is ultimately just bytes, and C pointers can point to any bytes in memory, and that means C can talk to anything, in a way that most other languages cannot. That makes C uniquely suited to be the universal glue between all languages.

Except that it isn't.

😱

There are several wrinkles in that story when it comes to Wasm. One small wrinkle is memory64, which is linear memory with 64-bit pointers. So we don't just have one C ABI; we have at least two, because linear-memory addresses can be either 32-bit or 64-bit, and programs compiled for one can't directly interoperate with programs compile for the other, even if they're both pretending to be C.

But we also have a big wrinkle: Wasm GC.

This whole idea about how all data is just bytes doesn't work in Wasm.

Wasm GC values are not accessible as just bytes. Wasm GC types can't be pointed to by C pointers. This means that C is not the fundamentally universal language on Wasm in the way that it can be within the realm of Unix processes.

And even beyond that, there are more wrinkles, such as data lifetimes. Nothing prevents the data indexed by a C pointer from dangling when data is deallocated. For decades, C got away with saying that Undefined Behavior was unavoidable, but today, many users are demanding different answers.

Clearly GC is the answer

As obvious as it seems in some circles that C ABIs are the answer, it is equally obvious in other spaces that some form of Wasm GC-based ABIs are the answer.

After all, if you look at the JVM, the CLR, or JavaScript, which are all very popular platforms that run wide varieties of programming languages, they all provide a set of GC types provided by the platform that everyone on those platforms just uses. This is simple, efficient, and proven. And unlike C, it doesn't have scary memory safety hazards. So it might seem to be the obvious answer for Wasm.

Except, there are winkles with that approach too.

😱

One is that even though C won't be the blessed language, linear-memory languages still do matter, and they can't easily interop with GC types. GC types can't easily point to linear memory, and linear-memory languages can't easily hold GC references, so GC types aren't universal either.

Another is that, in the spirit of Wasm as a whole, Wasm GC is being designed to be as language-independent as possible, and this has led it away from attempting to provide one-size-fits-all opinionated GC types for "string", "list", "dictionary", and so on. One person's list is a growable array, while another's is a cons list, and another's is a rope. In practice, programming languages have very different needs, and Wasm instead aims to provide primitive constructs that programming languages can use to build higher-level types.

Yet another is that when people are using systems that don't need a GC, they often don't want to have to use a GC. If we make GC be the universal glue, then we add GC dependencies in places that don't want them.

Do re me, RPC

Wasm isn't the first place in computing to have a need to connect different languages without having a single obvious blessed language. Networking protocols in particular are an area where no single language took hold, in part because most languages' type systems have things like pointers to mutable data, which is awkward to share over a network. Popular network protocols have often turned to IDLs, such as OpenAPI, Protobufs, or others, which make them all-to-all systems.

Should Wasm use one of these existing RPC-based cross-language systems?

Just as language-family cross-language systems have a place in the Wasm ecosystem, RPC systems do too. And just as before, there is also an over-arching need in Wasm for a common system.

When we scale up software systems, they tend to become distributed systems, so using an RPC protocol is tempting, as it would mean we'd be ready to go distributed, out of the box. On the other hand though, one of the lessons from CORBA is that making everything network-aware makes everything harder.

And, encoding calls into bytes and decoding them on the callee side has overhead, and it's overhead that would be difficult to optimize away in the case where we have two components running on the same computer.

So what we'd ideally want is a system that uses an IDL to achieve the same kind of all-to-all cross-language properties that RPC systems have, but which is isn't tied to either bytestream serialization or network awareness.

The Wasm component model

The Wasm component model is an all-to-all cross-language system. It has an IDL, and connects languages to each other without either side being aware of the other. And it isn't tied to bytestream serialization or network awareness.

There's a lot in the component model, but to get a taste of how it works, consider a type like string. There is no string type in core Wasm, so string is just a type in the interface type system. That means it doesn't have a fixed representation or even a fixed set of operations. It's just a set of logical values, which for string is the set of all sequences of Unicode Scalar Values.

Bindings for individual language work by encoding descriptions of how the Unicode Scalar Values are represented within their languages. This avoids either side of an interface knowing how the other side represents its values. And, it provides enough information to linkers to insert whatever adaptation code is needed:

If Wasm code is passing a string to the host, the host can just read the string data straight from the Wasm code's memory. No copying is needed in many cases!
If Wasm code using UTF-8 strings is passing a string to Wasm code using UTF-16 strings, the linking process can transparently insert UTF-8 to UTF-16 transcoding between then, so that strings can be passed without either side knowing the encoding of the other side.
If Wasm code is passing a string to Wasm code using the same encoding, the data can be copied. A copy may sound expensive to some ears, but keep in mind that this doesn't happen between component and host, it only happens between two components. And in today's C-like ABIs, there isn't a way to link two modules at all, so this isn't a regression of anything. And in GC land, there are ideas for how even this copy could get optimized away in the future.

And because there is no serialization, and no object request brokers or network awareness baked in, as compilers continue to optimize, component-model interfaces will be able to be inlined, because everything compilers need to do inlining is exposed up front.

So there's a lot more to it than this, but hopefully this gives a taste of how the system works.

For more information about using the component model, see the component model documentation.

Wrap up

There's a lot more to the component model, such as how the anticipated async support avoids the function "coloring" problem, though see here for a preview. This blog post is just about the cross-language aspects of the design.

Wasm needs an over-arching cross-language interface system, if it's to avoid long-term language-based fragmentation. The component model works differently from what people expecting it to be just C ABIs expect to find, and also different from what people expecting just GC types expect to find, but it has the properties that a unified ecosystem needs.