Tag: barrier synchronization

  • What I learned from writing synchronization barriers

    What I learned from writing synchronization barriers

    Before starting project 2 (for my advanced operating systems course), I took a snapshot of my understanding of synchronization barriers. In retrospect, I’m glad I took 10 minutes out of my day to jot down what I did (and did not) know because now, I get a clearer pictur eof what I learned. Overall, I feel the project was worthwhile and I gained not only some theoretical knowledge of computer science but I was also able to flex my C development skills, writing about 500 lines of code.

    Discovered a subtle race condition with lecture’s pseudo code

    Just by looking at the diagram below, it’s not obvious that there’s a subtle race condition hidden. I only was able to identify it after whipping up some code (below) and analyzing the concurrent flows. I elaborate a little more on the race condition — which results in a deadlock — in this blog post.

    Centralized Barrier

     

    [code lang=”C”]

    /*
    * Race condition possible here. Say 2 threads enter, thread A and
    * thread B. Thread A scheduled first and is about to enter the while
    * (count > 0) loop. But just before then, thread B enters (count == 0)
    * and sets count = 2. At which point, we have a deadlock, thread A
    * cannot break free out of the barrier
    *
    */

    if (count == 0) {
    count = NUM_THREADS;
    } else {
    while (count > 0) {
    printf("Spinning …. count = %d\n", count);
    }
    while (count != NUM_THREADS){
    printf("Spinning on count\n");
    }
    }

    [/code]

    Data Structures and Algorithms

    How to represent a tree based algorithm using multi-dimensional arrays in C

    For both the dissemination and tournament barrier, I had to build multi-dimensional arrays in C. I initially had a difficult time envisioning the data structure described in the research papers, asking myself questions such as “what do I use to index into the first index?”. Initially, my intuition thought that for the tournament barrier, I’d index into the first array using the round ID  but in fact you index into the array using the rank (or thread id) and that array stores the role for each round.

    [code lang=”C”]
    typedef struct {
    bool myflags[PARITY_BIT][MAX_ROUNDS];
    bool *partnerflags[PARITY_BIT][MAX_ROUNDS];
    } flags_t;

    void flags_init(flags_t flags[MAX_ROUNDS])
    {
    int i,j,k;

    for (i = 0; i < MAX_NUM_THREADS; i++) {
    for (j = 0; j < PARITY_BIT; j++) {
    for (k = 0; k < MAX_NUM_THREADS; k++) {
    flags[i].myflags[j][k] = false;
    }
    }
    }
    }
    [/code]

    OpenMP and OpenMPI

    Prior to starting I never heard of neither OpenMP nor OpenMPI. Overall, they are two impressive pieces of software that makes multi-threading (and message passing) way easier, much better than dealing with the Linux pthreads library.

    Summary

    Overall, the project was rewarding and my understanding of synchronization barriers (and the various flavors) were strengthen by hands on development. And if I ever need to write concurrent software for a professional project, I’ll definitely consider using OpenMP and OpenMPI instead of low level libraries like PThread.

  • Barrier Synchronization (Part 2/2)

    Barrier Synchronization (Part 2/2)

    Part 1 of barrier synchronization covers my notes on the first couple types of synchronization barriers including the naive centralized barrier and the slightly more advanced tree barrier. This post is a continuation and covers the three other barriers: MCS barrier, tournament barrier , dissemination barrier.

    Summary

    In the MCS tree barrier, there are two separate data structures that must be maintained. The first data structure (a 4-ary tree, each node containing a maximum of four children) handling the arrival of the processes and the second data structure handling the signaling and waking up of all other processes. In a nutshell, each parent node holds pointers to their children’s structure, allowing the parent process to wake up the children once all other children have arrived.

    The tournament barrier constructs a tree too and at each level are two processes competing against one another. These competitions, however, are fixed: the algorithm predetermines which process will advanced to the next round.  The winners percolate up the tree and at the top most level, the final winner signals and wakes up the loser. This waking up of the loser happens at each lower level until all nodes are woken up.

    The dissemination protocol reminds me of a gossip protocol. With this algorithm, all nodes detect convergence (i.e. all processes arrived) once every process receives a message from all other processes (this is the key take away); a process receives one (and only one) message per round. The runtime complexity of this algorithm is nlogn (coefficient of n because during each round n messages, one message sent from one node to its ordained neighbor).

    The algorithms described thus far share a common requirement: they all require sense reversal.

    MCS Tree Barrier (Binary Wakeup)

    MCS Tree barrier with its “has child” vector

    Summary

    Okay, I think I understand what’s going on. There are two separate data structures that need to be maintained for the MCS tree barrier. The first data structure handles the arrival (this is the 4-ary tree) and the second (binary tree) handles the signaling and waking up of all the other processes. The reason why the latter works so well is that by design, we know the position of each of the nodes and each parent contains a pointer to their children, allowing them to easily signal the wake up.

    Tournament Barrier

    Tournament Barrier – fixed competitions. Winner holds the responsibility to wake up the losers

    Summary

    Construct a tree and at the lowest level are all the nodes (i.e. processors) and each processor competes with one another, although the round is fixed, fixed in the sense that the winner is predetermined. Spin location is statically determined at every level

    Tournament Barrier (Continued)

    Summary

    Two important aspects: arrival moves up the tree with match fixing. Then each winner is responsible for waking up the “losers”, traversing back down. Curious, what sort of data structure? I can see an array or a tree …

    Tournament Barrier (Continued)

    Summary

    Lots of similarity with sense reversing tree algorithm

    Dissemination Barrier

    Dissemination Barrier – gossip like protocol

    Summary

    Ordered communication: like a well orchestrated gossip like protocol. Each process will send a message to ordained peer during that “round”. But I’m curious, do we need multiple rounds?

    Dissemination Barrier (continued)

    Summary

    Gossip in each round differs in the sense the ordained neighbor changes based off of Pi -> P(I + 2^k) mod n. Will probably need to read up on the paper to get a better understanding of the point of the rounds ..

    Quiz: Barrier Completion

    Summary

    Key point here that I just figured out is this: every processor needs to hear from every other processor. So, it’s log2N with a ceiling since N rounds must not be a power of 2 (still not sure what that means exactly)

    Dissemination Barrier (continued)

    Summary

    All barriers need sense reversal. Dissemination barrier is no exception. This barrier technique works for NCC and clusters.Every round has N messages. Communication complexity is nlogn (where N is number of messages) and log(n). Total communication nlogn because N messages must be sent every round, no exception

    Performance Evaluation

    Summary

    Most important question to ask when choosing and evaluating performance is: what is the trend? Not exact numbers, but trends.

  • Barrier Synchronization (Part 1/2)

    Barrier Synchronization (Part 1/2)

    As mentioned previously, there are different types of synchronization primitives that us operating system designers offer.  If as an application designer you nee to ensure only one thread can access a piece of shared memory at a time, use a mutual exclusion synchronization primitive. But what about a different scenario in which you need all threads to reach a certain point in the code and only once all threads reach that point do they continue? That’s where a barrier synchronization comes into play.

    This post covers two types of barrier synchronizations. The first is the naive, centralized barrier and the second is the a tree barrier.

    In a centralized barrier, we basically have a global count variable and as each thread enters the barrier, they decrement the shared count variable.  After decrementing the count, threads will hit a predicate and branch: if the count is not zero, then the thread enters a busy spin loop, spinning while the count is greater than zero. However, if after decrementing the counter equals zero, then that means all threads have arrived at the end of the barrier synchronization.

    Simple enough, right? Yes it is, but the devil is in the details because there’s a subtle bug, a subtle edge case. It is entirely possible (based off of the code snippet below) that when the last thread enters the barrier and decrements the count, all the other threads suddenly move beyond the barrier (since the count is not greater than zero). In other words, the last thread never gets to reset the count back to N number of threads.

    How to avoid this problem? Simple: add another while loop that guarantees that the threads do not leave the barrier until the counter gets reset. Very elegant. Very simple.

    One way to optimize the centralized barrier is to introduce a sense reversing barrier (as I described in “making sense of the sense reversing barrier”).

    The next type of barrier is a tree barrier. The tree barrier groups multiple process together at multiple levels (number of levels is logn where n is the number of processors), each group maintaining its own count and local sense variables. The benefit? Each group spins on its own locksense. Downside? The spin location is dynamic, not static and can impede performance on NUMA architectures.

    Centralized Barrier

    Centralized Barrier

    Summary

    Centralized barrier synchronization is pretty simple: keep a counter that decrements as each thread reaches the barrier. Every thread/process will spin until the last thread arrives, at which point the last thread will reset the barrier counter so that it can be used later on

    Problems with Algorithm

    Summary

    Race condition: last thread, while updating the counter, all other threads move forward

    Counting Barrier

    Summary

    Such a simple and elegant solution by adding a second spin loop (still inefficient, but neat nonetheless). Sense reverse barrier algorithm

    Sense Reversing Barrier

    Sense Reversing Barrier

    Summary

    One way to optimize the centralized barrier is to introduce a sense reversing barrier. Essentially, each process maintains its own unique local “sense” that flips from 0 to 1 (or 1 to 0) each time synchronization barrier is needed. This local variable is compared against a shared flag and only when the two are equal can all the threads/processes proceed past the current barrier and move on to the next

    Tree Barrier

    Tree Barrier

    Summary

    Group processes (or threads) and each group has its own shared variables (count and lock sense). Before flipping the lock sense, the final process needs to move “up to the next level” and check if all other processors have arrived at the next level. Things are getting a little more spicy and complicated with this type of barrier

    Tree Barrier (Continued)

    Summary

    With a tree barrier, a process arrives at its group (of count and lock sense), and will decrement the count variable and will then check the lock sense variable. If lock sense is not equal, then spin. If last

    Tree Barrier (continued)

    Summary

    Once the last process reaches the root, it’s their responsibility to begin waking up the lower levels, traversing back down the tree. At each level, they will be flipping the lock sense flag

    Tree Barrier (Continued)

    Summary

    As always, there’s a trade off or hidden downside with this implementation. First, the spin location is not statically determined. This dynamic allocation may be problematic, especially on NUMA (non uniformed memory access architecture) architecture, because a process may be spinning on a remote memory location. But my question is, are there any systems that do not offer coherence?