This post proposes and explores a design principle for components in complex software systems:
The ideas in this post aren't new; they come from papers and blog posts such as Robust Composition, Capabilities: Effects for Free, Parse, don't validate, the nanoprocess model, and the design choices in the Wasm component model, which itself incorporates ideas from Erlang, OCaml, Rust, COM, and many others. This post is an attempt to articulate what I see as one of the themes that runs through all of these. It's not any kind of official position, and it certainly won't be the last word on any of these topics.
I'd also be very interested in feedback on what makes sense here, what doesn't, and what's missing: mastodon, zulip, email.
This is motivated by WASI and Wasm components, however the core ideas generalize to any complex software system, including those using libraries, daemons, containers, VMs, microservices, or a combination.
One of the main ways we can make complex systems manageable is to make them modular. This means being able to add, remove, change, or understand individual components (or whatever a system is composed of) without needing to consider the system around them.
It's common for systems with large numbers of components to have problems with unexpected interactions between components. A common response to these kinds of problems is to impose a level of modularity by introducing heavyweight barriers, such as sandboxes, process boundaries, firewalls, or service meshes. These comprehensively block many of the avenues for components to interact. Then, since components still do need to communicate, it's common to effectively poke small holes in the barriers, such as allowing specific HTTP connections to pass through.
However, while the barriers-and-pinholes approach fixes the immediate problems, barriers are sometimes too comprehensive. They get in the way of connecting things that we do want to connect. This then leads us to do more things through the pinholes, or poke more pinholes, which then increases the risks of unexpected interactions again.
There are many reasons why we get stuck on this path, but one is that heavyweight barriers tend to focus us on limiting the mechanisms that let components interact, such as which components a component can directly talk to, and what kinds of messages it can send. However, the more fundamental problems are often in the relationships between components. One of the things we can do to avoid these problems, and promote modularity in a sustainable way, is to design component APIs that avoid ghosts.
By “ghost” here, I mean any situation where resources are referenced by plain data.
And by “plain data” here, I mean strings, integers, or any other data where
independently produced copies of the data are interchangeable. For example, two
completely independent parts of a system may create a string with the value
"Purple", and the two strings will be interchangeable.
Plain data can contain filenames, network addresses, usernames, or other forms of data which effectively reference resources.
For example, when we say that a particular string contains a filesystem path, we mean that it refers to an entity in a filesystem namespace. Filesystem namespaces are not explicitly passed as arguments in the APIs of many popular systems, so from the perspective of an API, while paths are explicit string parameters, the additional information referenced by those paths is not. In this post, we'll say this additional information is being carried by a “ghost”:
// We're explicitly passing a path, but implicitly // passing the namespace to resolve it in. do_stuff("/tmp/data.txt");
By passing a filename, the caller here is requiring that the callee have a specific filesystem namespace, in order to interpret that filename. This is an example of a relationship between components that's difficult to control with heavyweight barriers focused on mechanisms. The actual message is just a string, which could be communicated through practically any pinhole. And once the caller can send filenames through, it can depend on the callee having a particular namespace and being able to resolve those filenames, and we have the potential to get complex relationships between caller and callee, despite whatever barriers we put between them.
As another example, suppose one part of a system sets an environment variable, and another part of the system reads it.
⬅ In one place:
➡ In another:
char *timeout = getenv("TIMEOUT");
Here, two independent parts of the system both use the string
an identifier to send a message between them. As far as these specific parts of
the code know, it's as if the content of the message is carried by a ghost,
from one part to the other.
IP addresses are another example of plain data that references other resources. If one part of a system listens on a socket and sends the IP address to other parts of the system for them to connect to, the address is a plain-data list of integers, while the interpretation of those integers depends on a particular network view.
Ghosts can also occur within key-value stores, registries, brokers, buses, and many other things where the identifiers are plain data. There are situations where plain-data identifiers are the only option, such as when working with external resources. But when designing component APIs, we should seek to avoid ghosts where we can, and seek to identify and encapsulate ghosts where we can't.
Granted, the way all these things work isn't literally supernatural. We can figure out how filesystem namespaces, environment-variable dictionaries, networks, and other things make our resources available if we know some things about the surrounding system. However, that goes against our goal of modularity. We specifically don't want individual components knowing about the system around them.
The trouble with ghosts
Ghosts are often convenient, in the way that duct tape is convenient. They can quickly connect two things, even in a large system, without extensive changes. And on small scales, they sometimes work well. But like duct tape, they aren't a material one wants to build complex structures from.
Ghosts have four distinct problems as systems scale up in complexity:
- Ghosts don't always go to the places we want them to 👻➡😞. When we pass plain-data references around, they depend on the ghosts going to the same places. If our references go somewhere that the ghosts don't go, attempting to resolve them may fail, or may resolve to something unintended. An example of this is CWE-706 “Use of Incorrectly-Resolved Name or Reference”.
- Ghosts may go places we don't want them to 👻➡😲. For example, environment variable values are propagated to all child processes, even those that don't need them, and some programs log the contents of their environment for diagnostic purposes. If our variables contain sensitive information, it may get exposed. Similarly, ghosts may also persist for longer than we want them to, because cleaning them up can lead to dangling or even aliasing references. Examples of this include CWE-532 “Insertion of Sensitive Information into Log File” and CWE-386 “Symbolic Name not Mapping to Correct Object”.
- Ghosts may collide with other ghosts 👻➡💥⬅👻. In a complex system, the same name can end up getting used in multiple places. Naming conventions can help, but aren't enough if there are multiple instances of the same component within the larger system. In the case of filesystem namespaces, sometimes two different parts of a system need different versions of a resource, but they both expect it to be at the same path. An example of this is CWE-435 “Improper Interaction Between Multiple Correctly-Behaving Entities”.
- And sometimes, ghosts come from places they're not expected to ❓➡👻. When plain data can reference resources, any plain data within a system could potentially be representing a reference. Plain data may also be influenced by attackers. Examples of this are CWE-22 “Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal')” and CWE-73 “External Control of File Name or Path”.
Systems which use ghost patterns often face several challenges:
Ghosts complicate static analysis
Being unable to know where ghosts are going and where they're coming from makes it difficult and often impossible to answer questions such as:
“I have sensitive data flowing through part of the system. Where are all the places that might be able to access it?”
“If I change the behavior of something, what are all the things I need to update?”
“If there's a bug in something, what parts of the system could be affected?”
“If an attacker can control certain input data, what are all the things which they might be able to influence?”
Ghosts complicate debugging
A common way to debug complex systems is to isolate parts of the system and study how they behave independently. Ghosts create situations where components work differently when run independently than when they're run together, or work differently in different environments, making this kind of debugging more difficult.
Ghosts create hidden cause-and-effect relationships, making it harder to understand the system's behavior.
Ghosts are often a sign of over-sharing
Over-sharing happens when a component is given access to resources that it doesn't need. This often happens in namespace-oriented systems because it's difficult to precisely configure namespaces to be fine-grained and share only what's needed to each component. Even with features such as bind mounts on Linux, it can be tricky to make sure that every part of a complex system has access to all the things it needs, at the paths it expects them to be at, and nothing it doesn't need. As a result, programs are often run with more filesystem access than they strictly need.
This makes it difficult to follow the Principle of Least Authority (PoLA).
Ghosts can contribute to confused deputies
A common pattern in complex systems composed of multiple privilege levels is that some components are considered to run on behalf of specific users, which determine their privilege level. We can call components that work this way deputies of the users that own them.
When components send plain-data requests to components running as different users, senders may be able to reference resources they shouldn't be able to access. In such situations, receivers perform access control, explicitly checking requests to see whether the sender has the appropriate privileges. This is often tricky, especially when an API has a complex surface area. Receivers may get confused into doing things they shouldn't allow senders to ask them to do.
This is a form of the confused deputy problem.
A different kind of relationship
Systems which have a concept of handles—values which can be passed between components, but which are not plain data—can use them to avoid ghosts. Handles provide a way to make specific resources accessible across a component boundary without requiring any other relationship.
Handles make cause-and-effect relationships clear, since they are explicitly passed between components. And, receivers can assume that any handle they are passed represents a resource that the sender is allowed to ask them to operate on. That way, receivers need less authority of their own, which reduces the risk of them accidentally misusing their authority.
Ghosts can hide inside explicit sharing
One of the tricky things about ghosts is that they're about relationships rather than specific mechanisms. Mechanisms tend to be easy to understand, and to sandbox. But, relationships that permit ghosts can pass through even the most restrictive sandboxes.
This blog post talks a lot about implicitly shared resources, however that's not the only place ghosts can hide. For example, consider our example above of caller and callee implicitly sharing a filesystem namespace, and passing strings representing paths:
This is a ghost pattern, with a string carrying a reference to an implicitly shared namespace. A simple way we might try to eliminate such a ghost is to replace the use of an implicit namespace with an explicit filesystem root parameter:
This might be tempting, as it means that most of our code doesn't need to fundamentally change. It's a mostly mechanical change to just add root parameters in places where they're needed, and everything else about our code can stay the same.
We might then be tempted to claim that we've eliminated our ghosts here, because we now do the sharing via explicit communication rather than an implicitly shared resource. And we might indeed find that this code does afford us some added flexibility.
The problem is that this doesn't change the relationship. We're still using strings to identify specific resources within the filesystem root we're passing around. And that means we still have resources being referenced by plain data.
There's effectively a ghost, hiding inside the resource.
One way to think about it is in terms of granularity. While passing around handles to “root”, “world”, “namespace” or “registry” resources is better than implicit sharing, those kinds of resources tend to be coarse-grained. They can end up having ghosts hiding inside them. Plain-data references to specific items within coarse-grained resources can still have dynamic cause-and-effect relationships, and can still dangle, collide, or be influenced by attackers.
To avoid ghosts, it's not enough to change the mechanisms. To change the relationships, we need to switch from coarse-grained sharing to fine-grained sharing with handles. Instead of whole filesystems, we should ideally reference specific directories or even individual files, such as like this:
// Open the file using our own privileges. file_handle = open(root_handle, "tmp/file.txt"); // Instead of passing root_handle and a path, pass // *just* the one file handle to the other component. process_open_file(file_handle);
How to smell a ghost
There are some common signs that a ghost may be present.
String parameters which don't represent user data. String types in programming languages can hold many different kinds of things, such as names or text fields. And when a program is talking to the outside world, strings may also contain external identifiers such as filenames, network addresses, or URLs. But when software is talking to other software, resources should ideally be identified by handles, rather than by string identifiers. And as a bonus, this also helps minimize exposure to Unicode subtleties and quoting subtleties.
Heuristic: “Strings are for humans” 🌟
The word “the”. Whenever we find ourselves thinking about the filesystem, the network, the process, the host, the OS, or the computer, it often means we're making assumptions about state that might be shared between parts of a larger system. Wherever possible, components should not be aware of “the host”, or any entities associated with it, as nouns.
Heuristic: “Components should be hostless” 🌟
User identity outside the user interface. While there's a place for user-facing software to maintain an explicit knowledge of who they're acting on behalf of, components interfacing with other components should eagerly resolve that user authority to obtain finer-grained handles which can then be passed to other components. That way, those other components don't need the full access of the user, and will be less likely to make assumptions about shared state associated with the user.
Heuristic: “Handles are permissions” 🌟
Wrapping it up
An important property for complex software systems is that they be modular, where parts can be isolated from the whole. Ghosts, or resources referenced by plain data, create implicit relationships which must be considered when we add, remove, change, or understand individual components. They impede modularity, making complex systems less manageable.
This leads to a design principle for components in complex software systems: