Errors and Pipelines, a diversion

Posted on September 10, 2024

Consider the following Unix shell command:

$ a | b | c > d

This runs program a, pipes the output to program b, pipes that output to program c, and finally redirects that output to file d.

If we think of the process tree, we could draw a diagram for it like this:

             a   b   c
             ↑   ↑   ↑
             └─shell─┘

The shell spawns processes for a, b, and c, wires up their file descriptors to connect their I/O in the desired way, and then waits for them to complete.

From another perspective, this is a kind of call graph, in which the shell "calls" a, b, and c. They're on separate stacks and have separate threads of control, so they can coexist, but the shell waits for them to complete and collects their "return values".

Anyway. A very common question from newcomers in Unix is, "what's the difference between | and >?" It's easy, we tell them. | pipes to a command, while > redirects to a file. So easy, we say. But is it really easy?

A wild error appears

First let's consider the normal case. What happens if b encounters an error?

Well, each of these programs has been invoked by the shell, which is waiting for them to exit so that it can collect their error status. So b will exit with an error code, back to the shell.

Unix shells will silently ignore such errors. Oops. So ok let's back up and remember that if we're using Unix shell scripting for anything, we should add set -euo pipefail at the top of all our scripts.

Now the shell will notice the error.

And then, the other commands in the pipeline will be terminated with SIGPIPE. Or if they ignore the SIGPIPE, they will be given an EPIPE error next time they try to write output, which tells them to exit quietly. So the error propagates to the shell and a and c quietly exit:

            ❌  😱₀ ❌
            a   b   c
            ↑   ↑   ↑
            │   😱₁ │
            └─shell─┘

Great.

A wild error appears somewhere else

What happens if d fails?

Or wait, where even is d in that diagram?

Oh oh oh! That's the wrong question! Remember, we used > for d, so it's not a program. It's just a redirection to an output file. So it can't fail!

Or wait, it kindof can? Filesystems can run out of space. Disks can have errors. Things can happen. But in that case, why can't these errors be reported in the same way?

Oh that's right. It's because unlike the file descriptors for a's and b's outputs, the thing on the other side of c's file descriptor is just the operating system.

And operating systems can't fail. Or I should say, they can't report a failure back to the shell, because the shell isn't thinking of the OS as a process it's managing. The operating system can't push the error directly to where it needs to go, so it instead pushes the error backwards through the pipeline into c, and obliges c to report the problem to the shell for it.

          ❌  ❌  😱₁  😱₀
          a   b   c   OS
          ↑   ↑   ↑
          │   │   😱₂
          └─shell─┘

It wasn't c's error. c is just a stream transformer and really shouldn't care about where the output is going or what ultimately happens to it, just like a and b. But c is the unlucky one that happens to be at the end of the pipeline, so it gets the error and the job of ensuring that the error makes it back to the shell.

So here's a question. What would be different about this process if we inserted what might seem like a no-op into our command line:

$ a | b | c | cat > d

cat defaults to reading from stdin and writing to stdout, so it's just a no-op passthrough here. But inserting it does subtly change how the error handling works. With a cat in there, instead of the error being reported to c, the error is reported to cat. That means that c can now be simply terminated with SIGPIPE or told to exit quietly with EPIPE in exactly the same way as a and b.

         ❌  ❌  ❌   😱₁  😱₀
         a   b   c   cat  OS
         ↑   ↑   ↑   ↑
         │   │   │   😱₂
         └─shell─┴───┘

Which is kind of nice. And simple. It might look a little busier, but look closely at what this says. a, b, and c are all handled in the same way. c is just a stream transformer just like a and b. Because why should it be different? Why should it have to know anything about error reporting for errors that happen elsewhere? It doesn't need that if it's just writing to a pipe, because things like SIGPIPE or EPIPE automatically handle it, just like in the others.

What if we built a system that always did that for programs in pipelines, so that they don't need to worry about doing this extra thing? Do what we already do for all the other processes in a pipeline? And which would make it easier to write small programs that do one thing and do it well.

But we wouldn't want to always create an extra cat process. That would slow things down. What we'd really want to do is design a system that just behaves like that without needing an extra process. And perhaps such a system could explore making other things about a pipeline more efficient as well. All those processes and context switching have a cost.

That's it for now. I hope you enjoyed this diversion!