Errors and Pipelines, a diversion
Posted on
Consider the following Unix shell command:
$ a | b | c > d
This runs program a
, pipes the output to program b
, pipes that output to
program c
, and finally redirects that output to file d
.
If we think of the process tree, we could draw a diagram for it like this:
a b c ↑ ↑ ↑ └─shell─┘
The shell spawns processes for a
, b
, and c
, wires up their file descriptors
to connect their I/O in the desired way, and then waits for them to complete.
From another perspective, this is a kind of call graph, in which the shell "calls"
a
, b
, and c
. They're on separate stacks and have separate threads of control,
so they can coexist, but the shell waits for them to complete and collects their
"return values".
Anyway. A very common question from newcomers in Unix is, "what's the difference
between |
and >
?" It's easy, we tell them. |
pipes to a command, while >
redirects to a file. So easy, we say. But is it really easy?
A wild error appears
First let's consider the normal case. What happens if b
encounters an error?
😱₀ a b c ↑ ↑ ↑ └─shell─┘
Well, each of these programs has been invoked by the shell, which is wait
ing
for them to exit so that it can collect their error status. So b
will exit with
an error code, back to the shell.
Unix shells will silently ignore such errors. Oops. So ok let's back up and remember
that if we're using Unix shell scripting for anything, we should
add set -euo pipefail
at the top of all our scripts.
Now the shell will notice the error.
And then, the other commands in the pipeline will be terminated with SIGPIPE
.
Or if they ignore the SIGPIPE
, they will be given an EPIPE
error next time
they try to write output, which tells them to exit quietly. So the error
propagates to the shell and a
and c
quietly exit:
❌ 😱₀ ❌ a b c ↑ ↑ ↑ │ 😱₁ │ └─shell─┘
Great.
A wild error appears somewhere else
What happens if d
fails?
Or wait, where even is d
in that diagram?
Oh oh oh! That's the wrong question! Remember, we used >
for d
, so it's not
a program. It's just a redirection to an output file. So it can't fail!
Or wait, it kindof can? Filesystems can run out of space. Disks can have errors. Things can happen. But in that case, why can't these errors be reported in the same way?
Oh that's right. It's because unlike the file descriptors for a
's and b
's
outputs, the thing on the other side of c
's file descriptor is just the operating
system.
And operating systems can't fail. Or I should say, they can't
report a failure back to the shell, because the shell isn't thinking of the
OS as a process it's managing. The operating system can't push the error
directly to where it needs to go, so it instead pushes the error backwards
through the pipeline into c
, and obliges c
to report the problem to the
shell for it.
❌ ❌ 😱₁ 😱₀ a b c OS ↑ ↑ ↑ │ │ 😱₂ └─shell─┘
It wasn't c
's error. c
is just a stream transformer and really shouldn't
care about where the output is going or what ultimately happens to it, just like a
and b
.
But c
is the unlucky one that happens to be at the end of the pipeline, so it
gets the error and the job of ensuring that the error makes it back to the shell.
So here's a question. What would be different about this process if we inserted what might seem like a no-op into our command line:
$ a | b | c | cat > d
cat
defaults to reading from stdin and writing to stdout, so it's just
a no-op passthrough here. But inserting it does subtly change how the error
handling works. With a cat
in there, instead of the error being reported to
c
, the error is reported to cat
. That means that c
can now be simply
terminated with SIGPIPE
or told to exit quietly with EPIPE
in exactly the same
way as a
and b
.
❌ ❌ ❌ 😱₁ 😱₀ a b c cat OS ↑ ↑ ↑ ↑ │ │ │ 😱₂ └─shell─┴───┘
Which is kind of nice. And simple. It might look a little busier, but look closely
at what this says. a
, b
, and c
are all handled in the same way. c
is just a
stream transformer just like a
and b
. Because why should it be different?
Why should it have to know anything about error reporting for errors that happen elsewhere? It
doesn't need that if it's just writing to a pipe, because things like SIGPIPE
or
EPIPE
automatically handle it, just like in the others.
What if we built a system that always did that for programs in pipelines, so that they don't need to worry about doing this extra thing? Do what we already do for all the other processes in a pipeline? And which would make it easier to write small programs that do one thing and do it well.
But we wouldn't want to always create an extra cat process. That would slow things down. What we'd really want to do is design a system that just behaves like that without needing an extra process. And perhaps such a system could explore making other things about a pipeline more efficient as well. All those processes and context switching have a cost.
That's it for now. I hope you enjoyed this diversion!