Write your own thread-safe data structures

 

So last time I promised I would show you how to go about writing your own thread-safe data structure. But a good first question might be: do you really have to? Isn't there something built-in that you can use?

The short answer is: yes. But there's a 'but'. Unfortunately, Ruby (as a language) doesn't strongly embrace the multi-threading world. There is exactly one thread-safe data structure that ships with core Ruby. Besides that, there are some primitives, and other libraries that come with stdlib, that give you more building blocks for building your own thread-safe data structures.
 
In many cases, you can use the built-in Queue class. This is the thread-safe data structure that ships with Ruby. This is a pure Ruby class that exposes a FIFO queue with push/pop operations. This may work for you in some cases, but it may not work in other cases. If you're new to multi-threaded programming, I suggest you continue reading. But if you want a challenge afterwards, come back and follow that link to the implementation of Queue. It's relatively short and shows a few more of the available primitives in action. If it's hard to follow, don't worry about it, there's a few concepts in there I still haven't covered.
 
OK, onward to the sample code.
 
If we need a thread-safe data structure, that's because multiple threads will be attempting to modify it concurrently. At first glance, you might raise the (excellent) point that there's no reason to share a data structure globally between threads. You could just put your data in a SQL database, or Redis, and let them worry about keeping things consistent.
 
But there is a global data structure that I'm sure you use daily: a logger. What is a logger but a global data structure you can put messages into? Of course, a logger stores stuff on disk, rather than in memory, but it's subject to many of the same thread-safety concerns.
 
I'll show you how to approach writing a thread-safe logger. The first version will just store stuff in memory, but thanks to duck-typing, it can be easily swapped for file storage. Here's the first implementation, devoid of any thread-safety concerns:
 
    class Lager
      INFO = 1
      DEBUG = 0
 
      def initialize(log_level = INFO)
        @backend = Array.new
        @log_level = log_level
      end
 
      def info(message)
        log(message, INFO)
      end
 
      def debug(message)
        log(message, DEBUG)
      end
 
      def log(message, severity)
        if severity >= @log_level
          @backend << message
        end
      end
    end
 
Don't get distracted by the boilerplate stuff about log levels, I just included that so it would look and feel like the Logger you're used to.
 
Here's a basic idea of how this class might be used:
 
    logger = Lager.new
 
    10.threads do
      result = important_work
      logger.info "My result was #{result}"
    end
 
So we create one Lager object, yet 10 different threads will be using it concurrently. In a less contrived example, you can imagine this as the global Rails.logger for a Rails app where each request is being handled by a different thread, but writing to the same logger.
 
If you've been keeping up with my other emails, then you should recognize right away that the Lager#log method is not thread-safe. It's modifying an Array without any kind of synchronization. If there's a context switch to another thread before the current thread finishes that operation, the underlying data could become inconsistent.
 
It's very unlikely, but not impossible, that this will cause issues on MRI (because of the GIL), but this will almost certainly show issues on the GIL-less implementations.
 
Solution? Synchronize access to the critical bit of code using a Mutex. 
 
    class Lager
      def initialize(log_level = DEBUG)
        @backend = Array.new
        @log_level = log_level
        @mutex = Mutex.new
      end
 
      ...
 
      def log(message, severity)
        if severity >= log_level
          @mutex.synchronize do
            @backend << message
          end
        end
      end
    end
 
I left out the methods that didn't change. Each Lager instance needs one Mutex, and each time it writes to its @backend it needs to use that Mutex to synchronize access to that portion of the code.
 
Note: the fun thing about doing it this way is that if you change the @backend from an Array.new to a File.open('myapp.log'), the log method continues to work, providing the same thread-safety guarantees.
 
Another way to think about this is that you'll restrict the concurrency ability of this portion of the code. By wrapping it in a mutex, you're effectively saying that while one thread is inside this portion of the code, all other threads must wait to enter the same portion of code.
 
You can think of this like a bottleneck. The threads are free to do their work concurrently with no synchronization between them. Then when they get to this portion of the code, one thread must literally wait in line behind another thread executing that bit of code.
 
This is the reason why it's preferrable to use finer-grained mutexes where possible. The more code that a mutex wraps, the bigger the bottleneck. I've got more about how to organize synchronization, where to put it, and the cost of using it, in my upcoming book. Some of you have been asking for a release date. I can't make any promises, but I'm really trying to make it available in the next month. Keep reading my emails to stay in the loop.

So the key to writing your own thread-safe data structures is to protect access to critical code. With core Ruby, the most suitable mechanism we have for that is Mutex. I'll leave you with a link to Ruby's own Logger implementation, the one that comes with the standard library. Although it packs more features than our simple example, it uses the exact same mechanism to synchronize access to the portion of the code that writes to disk.
 
Until next time,
Jesse

comments powered by Disqus