A Unix Shell in Ruby - Part 4: Pipes

Published on March 18, 2012 by Jesse Storimer

 

Previously, in this series, we looked at interacting with environment variables (like PATH). In this article we'll investigate a peculiarity. It will reveal an impostor and eventually lead to implementing shell pipelines.

First, the Peculiarity

Observe these interactions with the shirt shell:

$ ./shirt 
$ ls
LICENSE README shirt
$ ls | grep README
README
$ ls > output.txt
$ cat output.txt && echo 'win'
LICENSE
README
output.txt
shirt
win

These are typical shell interactions, except that shirt hasn't implemented these features yet. There's no code in the project to handle features such as pipes, output redirection, conditional execution, etc. These things are basically syntactic sugar that the shell provides on top of the system capabilities, so it's up to shirt to implement it. So where does this peculiarity come from?

If you look at the shirt source it's not too hard to figure out. There are only two code paths that a command can take at the moment, either it's a builtin command or it gets sent to exec. None of the commands we used in the example session above were builtins so the strange behaviour must be coming from exec.

Before I unearth the mystery I want to make clear the expected behaviour of exec and then I'll do the big reveal.

The Base Case

In part 1 I explained the semantics of exec(2). It transforms the current process into a new process based on the command given as input.

That makes perfect sense when you do something like exec('ls'). The current process becomes an instance of ls. But what about something like exec('ls | grep README')? Based on my description this method invocation doesn't really make sense. There are actually two programs specified in that command joined by a pipe. The current process can't become two other processes, so there must be something else at play here.

Before we look at the source for a Ruby VM I want to show what is happening with the processes visually. This is output from a tool called pstree(1). It prints a textual representation of the process hierarchy clearly showing child processes of a given process, for instance1.

The following shirt session:

$ ./shirt
$ ls
LICENSE README shirt

yields this from pstree(1):

-+= 24853 jessestorimer ruby ./shirt
 \--- 31837 jessestorimer ls

This is exactly what we expected. We have a shirt process with one child, an ls process. This comes about because we created a child process with fork, then used exec to transform that to an ls process. So far so good.

An Impostor in the Lineage

The following shirt session:

$ ./shirt
$ ls | grep README
README

yields this from pstree(1):

-+= 57816 jessestorimer ruby ./shirt
 \-+- 58082 jessestorimer sh -c ls | grep README
   |--- 58083 jessestorimer ls
   \--- 58084 jessestorimer grep README

That one looks different. The first child of shirt is sh! sh, often referred to as /bin/sh is the Bourne shell, predecessor to Bash (a.k.a Bourne-Again SHell). It's the inspiration for the syntax and functionality of most modern shells and is available on most any Unix system. But let's go back and understand the pstree(1) output.

So the first child of shirt is sh. Then sh has two children in ls and grep. So that's how all of the commands I showed above are working: exec is actually sending that input to a different shell. This is certainly against the spirit of our project here and we want to implement everything that would otherwise be implemented by a shell like sh.

The Big Reveal

Let's look at the Rubinius source code for exec and see where it decides to use a subshell:

 1     # kernel/common/process.rb
 2 
 3     def exec(cmd, *args)
 4       if args.empty? and cmd.kind_of? String
 5         raise Errno::ENOENT if cmd.empty?
 6         if /([*?{}\[\]<>()~&|$;'`"\n\s]|[^\w-])/o.match(cmd)
 7           Process.perform_exec "/bin/sh", ["sh", "-c", cmd]
 8         else
 9           Process.perform_exec cmd, [cmd]
10         end
11       else
12         if cmd.kind_of? Array
13           prog = cmd[0]
14           name = cmd[1]
15         else
16           name = prog = cmd
17         end
18 
19         argv = [name]
20         args.each do |arg|
21           argv << arg.to_s
22         end
23 
24         Process.perform_exec prog, argv
25       end
26     end
27     module_function :exec

The most relevant lines for our purposes are #4 and #6. On line #4 it enters the first if block if it's given only a String as an argument. You can see that line #6 has a gnarly regex which, if matched, will cause exec to actually create an instance of /bin/sh and pass the string along to it. How can we prevent this from happening in shirt?

The simplest way is: never pass a string of input directly in to exec. You can see that, in that case, we'd enter the outer else block where there's no chance of shelling out to a different shell. The only exception to the rule is if you pass a 'plain' string to exec. This is something like exec('ls'). In that case it won't match the gnarly regex and will hit line #9 which does an exec on that command without a subshell.

So, the first argument given to exec should be the name of the program to be executed, while the rest of the arguments will be treated as a list of the command-line arguments to that program. In Ruby we know this as ARGV.

Here's the simple change we can make to shirt so that it won't end up invoking the Bourne shell:

  ...
  pid = fork {
    exec command, *arguments
  }
  ...

Previously we had just passed the line into exec here. The line being a string of input from the user.

Aftermath

This is great, now we can be sure that shirt isn't leaning on /bin/sh for features. The downside is that the stuff we were doing at the beginning doesn't work any more. Let's try a pipeline:

$ ./shirt
$ ls | grep README
ls: grep: No such file or directory
ls: |: No such file or directory
README

That's some pretty strange output. In this case ls is being sent to exec as the command to execute and the rest of the line is being treated as the argument list for ls. Since ls expects its arguments to be names of files or directories it's telling us that its first two arguments don't exist on the filesystem. Notice that it successfully printed the last entry? That's because it's in a directory that actually contains a README file, so that one does work correctly.

So shirt needs pipes. Time to start plumbing.

Shell Pipelines Demystified

I think shell pipelines are misunderstood. For the longest time I had no idea how they worked. More than once I read someone casually tossing up something like:

Yeah, the shell just uses the pipe(2) system call to hook up the stdout from one process to the stdin of the other.

Oh?

I had some idea what stdin and stdout was, but I didn't know anything about a pipe(2) system call. It was certainly harder to understand than the simplicity of using pipes in my shell.

The reason I'm telling you this is because I want you to understand that there's no mystery behind pipes in the shell. Going forward I'm going to refer to pipes in the shell as 'shell pipelines' or just 'pipelines'. When I say 'pipe' I'm going to be referring to the programmatic concept of the pipe, which is used to implement shell pipelines but can also be used for a myriad of other things.

The Shell Pipeline

First things first, we know we're going to need a 'pipe'. What does that look like? Let's ask irb.

>> IO.pipe
=> [#<IO:fd 5>, #<IO:fd 6>]

In Ruby, IO.pipe will create a pipe using the pipe(2) system call. It returns an array with two IO objects. These IO objects are a bit like anonymous files. They can read from them and written to like files, but they have no location on the filesystem.

Something else about the pipe: Although it returns two IO objects the pipe only allows for a uni-directional stream of data. So the first IO object in that array is read-only, the second IO object is write-only. In this way data can travel only one way along the pipe.

So, step one in creating a shell pipeline is to create the pipe. Let's represent this with a diagram. This is a shirt process:

So our shell now has a pipe. Let's say, for example, that we're going to be creating the following pipeline: ls | grep README. There are two processes there so shirt must fork two child processes.

Remember that fork creates an exact copy of the calling process. So the child processes got their own copy of the pipe that the parent process created.

Eventually the left child will become ls and the right will become grep. But first we need to set up the pipe.

We're going to tell the left child (the one that will become ls) that it should redirect its $stdout to the write-only IO object. In this way anything that ls writes to its $stdout will fill up the pipe and wait there for the grep process to read it.

Similarly we'll tell the right child (the one that will become grep) that it should read its $stdin from the read-only IO object. This will be the receiving end of whatever the ls process writes to the pipe. That might look something like this:

There's a bit of housecleaning to do still but we'll look at that when we step through the code.

Now the first child process does its exec and becomes ls. The second process does the same and becomes grep. Just to give you a visual:

That's a rough overview of how a shell pipeline works from the view of processes. Now let's dive into the implementation in shirt.

The Ending, Then the Beginning

I'm going to start by pasting the code shirt has after implementing pipelines. Then we'll step through the changes line by line.

  ...
    commands = split_on_pipes(line)

    placeholder_in = $stdin
    placeholder_out = $stdout
    pipe = []

    commands.each_with_index do |command, index|
      program, *arguments = Shellwords.shellsplit(command)

      if builtin?(program)
        call_builtin(program, *arguments)

      else
        if index+1 < commands.size
          pipe = IO.pipe
          placeholder_out = pipe.last
        else
          placeholder_out = $stdout
        end

        spawn_program(program, *arguments, placeholder_out, placeholder_in)

        placeholder_out.close unless placeholder_out == $stdout
        placeholder_in.close unless placeholder_in == $stdin
        placeholder_in = pipe.first
      end
    end

    Process.waitall
  end

The first notable change is this line, after we get a line of input from $stdin:

commands = split_on_pipes(line)

...and here's the implementation:

def split_on_pipes(line)
  line.scan( /([^"'|]+)|["']([^"']+)["']/ ).flatten.compact
end

This method takes a line of input and splits it using pipes as a delimiter. It's a bit hard to grok but the gist of it is that it matches one or more characters that are not a quote or a pipe or it matches a quote, followed by some non-quote characters, followed by another quote. Full credit goes to http://stackoverflow.com/a/4970136/1124616. It returns an array of command strings.

The reason we dance around the quotes and go to that trouble is that it should be possible to include a pipe character inside a quoted string, to pass as an argument to a program. If we were just to use a naive line.split('|') then we'd end up splitting on pipe characters that were passed inside quotes as program arguments.

    placeholder_in = $stdin
    placeholder_out = $stdout
    pipe = []

This is simply some boilerplate setup. We'll talk about these variables further into the source.

    commands.each_with_index do |command, index|
      program, *arguments = Shellwords.shellsplit(command)

      if builtin?(program)
        call_builtin(program, *arguments)

Now that we can have multiple commands running at one time we need to process each one individually, hence the loop.

There's something notable here about the handling of builtins: it hasn't changed. Even though we'll be supporting pipes in the shell the builtin commands will not be able to use pipes in the same way that subprocesses do. I hope the reason why becomes clear as we look at the next two blocks of code, but I'll be sure to remind you when the time comes.

      if index+1 < commands.size
        pipe = IO.pipe
        placeholder_out = pipe.last
      else
        placeholder_out = $stdout
      end

This is some setup before we actually spawn a new process for the current command. This particular piece of code is responsible for setting up the future $stdout of the current command.

The else block takes care of the case where the current command is the last one in the pipeline. In that case its $stdout should be set to $stdout of the shell. In this way the last command will print its output back to the terminal.

The if block takes care of the other commands in the pipeline. It creates a new pipe and assigns the future $stdout of the current command to the write-only IO object that it returned.

Then we pass the program, arguments, and the placeholder stdin and stdout to the spawn_program method.

        spawn_program(program, *arguments, placeholder_out, placeholder_in)

Here's the implementation:

 def spawn_program(program, *arguments, placeholder_out, placeholder_in)
  fork {
    unless placeholder_out == $stdout
      $stdout.reopen(placeholder_out)
      placeholder_out.close
    end

    unless placeholder_in == $stdin
      $stdin.reopen(placeholder_in)
      placeholder_in.close
    end

    exec program, *arguments
  }
end

The fork + exec parts of this code are the same before, but there's new code too. There's a block of code to reopen $stdout to point to the stdout placeholder we assigned earlier, this would be a write-only end of a pipe. If there is no placeholder stdout (ie. this is the last command in a pipeline) then we don't modify $stdout.

We use IO#reopen to change $stdout to point to a different IO object. We could have instead passed an instance of File to have $stdout write to a file.

We give the same treatment to $stdin below that.

One very important point about this code: the placeholder IO objects are closed after being passed to #reopen. Why? We need a quick review of what happens with IO#reopen. So $stdout.reopen(placeholder_out) will reassociate $stdout with the file descriptor that's pointed at by placeholder_out. So now there are two references to the same file descriptor. Once we do the exec and ls finishes writing to its $stdout it will close its $stdout. This signals the process listening on the read-only end of the pipe that there is no more data to be sent so it can stop listening. But, if we didn't close placeholder_out then the process listening on the read-only end of the pipe could never stop listening for more data because there is still an open reference to the write-only end of the pipe that might be written to. So, make sure to close references to IO objects you're not going to use.

In the end the program gets passed to exec and the modified $stdout and $stdin are shared with the executed program.

This is a good time to bring up builtins again. The reason that builtins can't participate in pipelines in this way is because they don't have their own $stdin and $stdout to redirect. They share these with shirt since they're called in the context of the shell itself. So we'll have to hook them up in a different way, we'll save that for a later article.

Lastly:

        placeholder_out.close unless placeholder_out == $stdout
        placeholder_in.close unless placeholder_in == $stdin
        placeholder_in = pipe.first
      end
    end

    Process.waitall

The first two lines here are related precisely to what I explained in the last section. The parent process also holds references to the IO objects from the pipes. In order for the child processes to communicate properly the parent process needs to close its references to those IO objects.

The next line assigns the future $stdin for the next process to the read-only IO object returned from the pipe generated for the current command. Putting it at the end of the loop ensures that every command in the pipeline gets its $stdin reassociated, except for the first command through the loop. This makes perfect sense because the first command can't take input from a command that came before it, because there isn't one.

The last line here tells the current process to wait for all child processes to exit. Since we're now spawning multiple processes at one time we have to wait for all of them to finish before we can return control to the terminal. Currently shirt doesn't inspect the return value from that call to check for failures, etc. but it'll need to do that in the future.

Whew!

If you're still reading then I hope you learned a thing or two and now understand how simple shell pipelines are. The more I look at the implementation of a Unix shell the more I realize how many system primitives were originally created for the purpose of implementing a shell.

Next time around we'll implement another feature that was being handled by a subshell: io redirection.


If you enjoy learning about this stuff then check out Working With Unix Processes because it's all about Unix programming for the Rubyist.

The source for this article is available on github.

1 I got this data by first installing pstree through homebrew. Then I started up a shirt instance with ./shirt. I used ps(1) to get the pid of the shirt process. Then I fed that into pstree as pstree <pid> to get the output you see.


Like what you read?

Join 2,000+ Ruby programmers improving their skills with exclusive content about digging deeper with sockets, processes, threads, and more - delivered to your inbox weekly.

I'll never send spam and you can unsubscribe any time.


comments powered by Disqus