Nobody understands the GIL - Part 2: Implementation

Published on June 14, 2013 by Jesse Storimer

Last time, I began wanting to take you on a deep dive into MRI to see how the GIL is implemented. But first, I wanted to make sure I was asking the right question. Part 1 formulated the question, but today we'll look for answers inside MRI itself. We'll go looking for that elusive creature they call the GIL

In the first draft of this post, I really emphasized the C code that underlies the GIL, showing as much of it as I could. But after a while, the message was drowning in the details. I went back and reworked things so that you now see less C code with more explanations and diagrams. But for the source code divers, I'll at least mention the C functions, so you can go track them down.

Last time on...

Part 1 left off asking these two questions:

  1. Does the GIL guarantee that array << nil is atomic?
  2. Does the GIL guarantee that your Ruby code will be thread-safe?

The first question is answered by looking at the implementation, so let's start there.

This snippet was spinning us around last time:

array = []

5.times.map do
  Thread.new do
    1000.times do
      array << nil
    end
  end
end.each(&:join)

puts array.size

If you assume that Array is thread-safe, the expected result would be an Array with 5000 elements. Since Array isn't thread-safe, implementations like JRuby and Rubinius produce an unexpected result; something less than 5000. It's the interaction, and switching between, multiple threads that corrupted the underlying data.

MRI produces the expected result, but is it a fluke or a guarantee? Let's begin our technical deep dive with a snippet of Ruby code to study.

Thread.new do
  array << nil
end

From the top 

To study what happens with this snippet, we need to look at how a thread is spawned inside MRI. We'll mostly be looking at functions from the thread*.c files in MRI. There's lots of indirection in these files to support both the Windows and Posix thread APIs, but all of the functions we look at come from those source files.

The first underlying operation of Thread.new is to spawn a new native thread to back the Ruby thread. The C function that becomes the body of the new thread is called thread_start_func_2. Here's a look at this function at a high-level.

 

There's a lot of boilerplate code here that you don't need to see. I highlighted the parts of the function that we care about. Near the top, this new thread acquires the GIL. Remember that this thread remains idle until it actually owns the GIL. Near the middle of the function, it calls the block that you pass to Thread.new. After wrapping things up, it releases the GIL and exits the native thread.

In our snippet, this new thread is spawned from the main thread. Given this, we can assume that the main thread is currently holding the GIL. This new thread will have to wait until the main thread releases it before it can continue.

Let's look at how that happens when this new thread tries to acquire the GIL.

static void
gvl_acquire_common(rb_vm_t *vm)
{
  if (vm->gvl.acquired) {
    vm->gvl.waiting++;
    if (vm->gvl.waiting == 1) {
      rb_thread_wakeup_timer_thread_low();
    }

    while (vm->gvl.acquired) {
      native_cond_wait(&vm->gvl.cond, &vm->gvl.lock);
    }

This is a snippet from the gvl_acquire_common function. This function is called when our new thread tries to acquire the GIL.

First, it checks if the GIL is currently acquired. If it is, then it increments the waiting attribute of the GIL. With our snippet, this value should now be 1. The very next line checks to see if waiting is 1. It is, so the next line triggers a wakeup of a timer thread.

This timer thread is the secret sauce that keeps MRI's threading system humming along, and keeps any one thread from hogging the GIL. But before we jump too far ahead, let's illustrate the state of things with the GIL, then introduce this timer thread.

I've said a few times that an MRI thread is backed by a native OS thread. This is true, but this diagram suggests that each MRI thread is running in parallel on its native thread. The GIL prevents this. We need to draw in the GIL to make this more realistic.

 

When a Ruby thread wants to execute code in its native thread, it must first acquire the GIL. The GIL mediates access between a Ruby thread and its underlying native thread, severely reducing parallelism! In the previous diagram, the Ruby threads could execute code in their native threads in parallel. This second diagram is closer to reality for MRI, only one thread can hold the GIL at any given time, so parallel execution of code is completely disabled.

 

For the MRI core team, the GIL protects the internal state of the system. With a GIL, they don't require any locks or synchronization around the internal data structures. If two threads can't be mutating the internals at the same time, then no race conditions can occur.

For you, the developer, this will severely limit the parallelism you get from running your Ruby code on MRI.

The timer thread

I said that the timer thread is what keeps one thread from hogging the GIL. The timer thread is just a native thread that exists internally in MRI; it has no equivalent Ruby thread. The timer thread is started up when MRI starts up with the rb_thread_create_timer_thread function.

When MRI boots up and only the main thread is running, the timer thread sleeps. But remember, once one thread is waiting on the GIL, it wakes up the timer thread.

This is closer to how the GIL is implemented in MRI. Harking back to the snippet that kicked this off, the thread on the far right is the one we just spawned. Since it's the only thread waiting for the GIL, it wakes up the timer thread.

This timer thread is what keeps a thread from hogging the GIL. Every 100ms, the timer thread sets an interrupt flag on the thread currently holding the GIL using the RUBY_VM_SET_TIMER_INTERRUPT macro. It's important to note the details here because this will give us the clue as to whether or not array << nil is atomic.

If you're familiar with the concept of timeslices, this is very similar.

Every 100ms the timer thread sets an interrupt flag on the thread currently holding the GIL. Setting an interrupt flag doesn't actually interrupt the execution of the thread. If that were the case, we could be certain that array << nil was not atomic.

Handling the interrupt flag

Deep in a file called vm_eval.c is the code for handling Ruby method calls. It's responsible for setting up the context for a method call and calling the right function. At the end of a function called vm_call0_body, right before it returns the return value of the current method, these interrupts are checked.

If this thread's interrupt flag has been set, then it stops execution on the spot, before returning its value. Before executing any more Ruby code, the current thread will release the GIL and call sched_yield. sched_yield is a system function prompting the thread scheduler to schedule another thread. Once this has been done, this interrupted thread attempts to re-acquire the GIL, now having to wait for another thread to release it.

Oh hey, this answers our question. array << nil is atomic. Thanks to the GIL, all Ruby methods implemented in C are atomic.

So this example:

array = []

5.times.map do
  Thread.new do
    1000.times do
      array << nil
    end
  end
end.each(&:join)

puts array.size

is guaranteed to produce the expected result every time when run on MRI.

But keep in mind that this guarantee isn't spelled out in the Ruby code. If you take this code to another implementation with no GIL, it will produce an unexpected result. It's good to know what the GIL guarantees, but it's not a great idea to write code that depends on it. In doing so, you basically put yourself in a vendor lock-in situation with MRI.

Similarly, the GIL is not a public API. It has no documentation and no specification. There's Ruby code out there that implicitly depends on the GIL, but the MRI team has talked before about getting rid of the GIL or changing its semantics. For these reasons, you certainly don't want to be writing code that depends on the current behaviour of the GIL.

Non-native methods

All I've said so far is that array << nil is atomic. This is an easy one because you have the Array#<< method taking a parameter that is a constant. There's only one method invocation in this expression and it's implemented in C. If it's interrupted during its course, it will simply continue until finished, then release the GIL.

What about something like this?

array << User.find(1)

Before the Array#<< method can execute, it must evaluate the expression on the right hand side so it can pass its value as a parameter. So User.find(1) has to be invoked first. And as you know, User.find(1) will call lots of other Ruby code inside its implementation.

So, methods implemented with Ruby code have no atomicity guarantees on MRI. Only methods implemented with native C code have this guarantee.

So, does that mean Array#<< is still atomic in the above example? Yes, but only once the right hand side has been evaluated. In other words, the User.find(1) method will be invoked with no atomicity guarantee. Then it's return value will be passed to Array#<<, which retains its atomicity guarantee.

Update: @headius wrote an excellent comment that expands on what guarantees the GIL provides. If you've read this far, consider it required reading!

What does it all mean?

The GIL makes method invocations atomic. What does this mean for you?

In part 1, I showed what could happen to an example C function when a context switch happened right in the middle of it. With a GIL, that situation is no longer possible. Rather, if that context switch were to happen, the other thread would remain idle waiting for the GIL, giving the current thread a chance to continue uninterrupted. This behaviour is only applicable when MRI implements the Ruby method in C.

This behaviour eliminates a host of race conditions that would otherwise happen inside MRI and need to be guarded against. From this perspective, the GIL is strictly an internal implementation detail of MRI. It keeps MRI safe.

But there's still a lingering question that wasn't answered. Does the GIL provide any guarantee that your Ruby code will be thread-safe?

This is an important question for anyone using MRI, and if you're familiar with multi-threaded programming in other environments, you probably already know that the answer is a resounding no. But this article is long enough, I address this question more thoroughly in part 3.


Like what you read?

Join 2,000+ Ruby programmers improving their skills with exclusive content about digging deeper with sockets, processes, threads, and more - delivered to your inbox weekly.

I'll never send spam and you can unsubscribe any time.


comments powered by Disqus