According to the omscentral reviews for the advanced operating systems course, the midterm exams are nearly identical to the previous semester’s exams and former students strongly suggest rote memorization as a the primary study method. In my opinion, these types of tests do not really serve as a great litmus test for evaluating a student’s understanding. Nonetheless, I’ll prepare for the exam by going over the material in three passes using space repetition1 and active recall and testing.
Pass One – Guess
Step through each question from Spring 2020 midterm, attempting to answer them myself without peeking at the answers. This helps me gauge my understanding, allowing me to honestly evaluate any gaps I need to fill.
Pass Two – Comparison against answer key
After pass one and attempting to answer the questions on my own, I will then compare my answers to the solution guide and reflect on what I got right and what I got wrong.
Pass Three – Spaced repetition
Anki screenshot – question 1a from midterm (advanced operating systems Spring 2020
After the first two passes, I’ll copy the questions (and their solutions) into my digital flash cards (i.e. Anki) and then start ramping up using spaced repetition.
References
Studies show that spaced repetition and testing are scientifically proven to help knowledge acquisition: https://artofmemory.com/wiki/Spaced_Repetition_and_Recall
If you are an online masters of computer science student (OMSCS) at Georgia Tech and enrolled in advanced operating systems (AOS) course, you might want to check out the notes I’ve taken for the lectures by clicking on the advanced operating systems category on my blog. For each video lecture, I’ve written down a summary and the key take away. These notes may help you out by giving you a quick overview of the topics or help you decide what sections to revisit or skip.
At the time of this writing, the write ups start from the refresher course (e.g. memory systems) all the way up until the end of parallel systems (i.e. last lectures included as part of the midterm).
The key take away for scheduling is that as OS designers you want to follow this mantra: “keep the caches warm“. Following this principle will ensure that the scheduler performs well.
There are many different scheduling algorithms including first come first serve (FCFS), fixed processor (focus on fairness), fixed processor (thread runs on the same process every time), last processor scheduling (processor will select thread that last ran on it, defaulting to choosing any thread), and minimum intervening (checks what other threads ran between, focusing on cache affinity). One modification to minimum intervening is a minimum intervening with a queue (since threads sitting in the queue may pollute the cache as well, so we’ll want to choose the minimum between the two).
One major point of the different scheduling algorithms is that there’s no correct answer: we need to look at trends and look at the overhead. The latter option (i.e. minimum intervening thread with queues) seems like the best solution but really what about the hidden costs like keeping track of the internal data structures?
Regardless of which of the above policies are chosen, the OS designer must exercise caution during implementation. Although designing the policy with a single global queue may work for a small system with a handful of processors or threads, imagine a system with dozens of processors and hundreds of threads: what then? Perhaps build affinity-based local queues instead, each queue policy specific (nice for flexibility too).
Finally, for performance, think about how to avoid polluting the cache. Ensure that the memory footprint of the threads (either frugal or hungry threads, combination of the two) footprint do not exceed the L2 cache. Determining what threads are frugal or hungry requires the operating system to profile the processes, additional overhead to the OS. So we’ll want to minimize profiling.
Scheduling First Principles
Summary
Mantra is always the same: “keep the caches warm”. That being said, how do we (as OS designers), when designing our schedulers, choose which thread or process that should run next?
Quiz: Scheduler
Summary
How should scheduler choose the next thread – all of the answers are suitable. Remainder of lecture will focus on “thread whose memory contents are in the cpu cache”. So … what sort of algorithm will we come up with determine whether another processor’s have its contents in the cache. I can imagine a few naive solutions
Memory Hierarchy Refresher
Memory Hierarchy Refresher – L1 cache costs 1-2 cyces, L2 caches about 10 days, and memory about 100 cycles
Summary
Going from cache to memory is a heavy price to pay, more than two orders of magnitude. L1 cache takes 1-2 cycles, L2 around 10 cycles, and memory around 100 cycles
Cache affinity scheduling
Summary
Ideally a thread gets rescheduled on the same processor as it ran on before. However, this might not be possible due to intervening threads, the cache being polluted.
Scheduling Policies
Summary
Different types of scheduling policies including first come first serve (focuses on fairness, not affinity), fixed processor scheduling (every thread runs on the same processor every time), last processor scheduling (processor will pick thread that it previously ran, falling through to picking any thread), and minimum intervening (most sophisticated but requires state tracking).
Minimum Intervening Policy
Summary
For a given thread (say Ti), the affinity is the number of threads that ran in between Ti’s execution
Minimum Intervening Plus Queue Policy
Summary
Minimum Intervening with queue – choose minimum between intervening number and queue size
Attributes of OS is to quickly make a decision and get out of the way, hence why we might want to employ minimum intervening scheduler (with limits). Separately, we want our scheduler to take into account of the queued threads, not just the affinity, since it’s entirely possible for affinity to be low for a CPU but for the queue to contain other threads, polluting the cache. So the minimum intervening plus queue takes both the affinity and the queued threads into account
Summarizing Scheduling Policies
Summary
Scheduling policies can be categorized into processor centric (what thread should a particular processor should choose to maximize chance of cache amount of cache content will be relevant) and thread centric (what is the best decision for a particular thread with respects to its execution). Thread centric: fixed and last processor scheduling policies. Processor centric is minimum intervening and intervening plus queue.
Quiz: Scheduling Policy
Summary
With the Minimum interleaving with queues, we select the processor that has the minimum value between number of intervening threads and minimum number of items in the queue.
Implementation Issues
Implementation Issues – instead of having a single global queue, one queue per CPU
Summary
One way to implement the scheduling is to use a global queue. But this may be problematic for systems with lots of processors. Another approach is to have one queue per processor, and each queue can have its own policy (e.g. first come first served, fixed processor, minimum intervening). And within the queue itself, the threads position’s is determined by: base priority (when thread first launched), age (how long thread has been around) and affinity
Performance
Summary
How do we evaluate the scheduling policies? We can look at it from the system’s point of view: throughput. That is, what are the number of threads that get executed per unit of time. From another viewpoint: user centric, which consists of response time (end to end time) and variance (i.e. deviation of end to end times). With that in mind, the first come first serve would perform poorly due to high variance, given that policy does not distinguish one thread from another thread
Performance continued
Performance of scheduler – throughput (system centric) vs response time + variance (user centric)
Summary
A minimum intervening policy may not suitable for all work loads. Although the policy may work well for small to light to medium work loads, may not perform very well when system under stress because caches will get polluted. In this case, a fixed processing scheduling may be more performant. So no one size fits all. Also, may want to introduce delays in scheduling, a technique that works in both synchronization and in file systems (which I will learn later on in the course, apparently)
Cache Affinity and Multicore
Summary
Hardware can switch out threads seamlessly without the operating system’s knowledge. But there’s a partnership between hardware and the OS. The OS tries to ensure that the working set lives in either L1 or at L2 cache; missing in these caches and going to main memory is expensive: again, can be twice order of magnitude.
Cache Aware Scheduling
Cache aware scheduling – categorize threads into frugal vs hungry threads. Make sure sum of address space between two do not exceed size of L2 cache
Summary
For cache aware scheduling, we categorize threads into two: cache hungry and cache frugal. Say we have 16 hardware threads, we want to make sure that during profiling, we group them and make sure that cache size of hungry and cache size of frugal have a working set size less than cache (L2). But OS must be careful to profiling and monitoring — should not heavily interfere. In short, overhead needs to be minimal.
Remote procedure call (RPC) is a framework offered within operating systems (OS) to develop client/server systems and they promote good software engineering practices and promote logical protection domains . But without careful consideration, RPC calls (unlike simple procedure calls) can be cost prohibitive in terms over overhead incurred when marshaling data from client to server (and back).
Out of the box and with no optimization, an RPC costs four memory copy operations: client to kernel, kernel to server, server to kernel, kernel to client. On the second copy operation, the kernel makes an upcall into the server stub, unmarshaling the marshalled data from the client. To reduce these overhead, us OS designers need a way to reduce the cost.
To this end, we will reduce the number of copies by using a shard buffer space that gets set up by the kernel during binding, when the client initializes a connection to the server.
RPC and Client Server Systems
The difference between remote procedure calls (RPC) and simple procedure calls
Summary
We want the protection and want the performance: how do we achieve that?
RPC vs Simple Procedure Call
Summary
An RPC call, happens at run time (not compile time) and there’s a ton of overhead. Two traps involved. First is call trap from client; return trap (from server). Two context switches: switch from client to server and then server (when its done) back to client.
Kernel Copies Quiz
Summary
For every RPC call, there are four copies: from client address space, into kernel space, from kernel buffer to server, from server to kernel, finally from kernel back to client (for the response)
Copying Overhead
The (out of the box) overhead of RPC
Summary
Client Server RPC calls require the kernel to perform four copies, each way. Need to emulate the stack with the RPC framework. Client Stack (rpc message) -> Kernel -> Server -> Server Stack. Same thing backwards
Making RPC Cheap
Making RPC cheap (binding)
Summary
Kernel is involved in setting up communication between client and server. Kernel makes an up call into the server, checking if the client is bonafide. If validation passes, kernel creates a PD (a procedure descriptor) that contains the three following: entry point (probably a pointer, I think), stack size, and number of calls (that the server can support simultaneously).
Making RPC Cheap (Binding)
Summary
Key Take away here is that the kernel performs the one-time set up operation of setting up the binding, the kernel allocating shared buffers (as I had correctly guessed) and authenticating the client. The shared buffers basically contain the arguments in the stack (and presumably I’ll find out soon how data flows back). Separately, I learned a new term called “up calls”.
Making RPC Cheap (actual calls)
Summary
A (argument) shared buffer can only contain values passed by value, not reference, since the client and server cannot access each other’s address spaces. The professor also mentioned something about the client thread executing in the address space of the server, an optimization technique, but I’m not really following.
Making RPC Cheap (Actual Calls) continued
Summary
Stack arguments are copied from “A stack” (i.e. shared buffer) to “E” stack (execution). Still don’t understand the entire concept of “doctoring” and “redoctoring”: will need to read the research paper or at least skim it
Making RPC Cheap (Actual Calls) Continued
Summary
Okay the concept is starting to make sense. Instead of the kernel copying data, the new approach is that the kernel steps back and allows the client (in user space) copy data (no serialization, since semantics are well understood between client and server) into shared memory. So now, no more kernel copying, just two copies: marshal and unmarshal. Marshal copies from client to server. And from server to client.
Making RPC Cheap Summary
Making RPC calls cheap (summary)
Summary
Explicit costs with new approach: 1) client trap and validating BO (binding operation) 2) Switching protection domain from client to server 3) Return trap to go back into client address space. But there are also implicit costs like loss of locality
RPC on SMP
Summary
Can exploit multiple CPUs by keeping cache warm by dedicating processors to servers
RPC on SMP Summary
Summary
The entire gist is this: make RPC cheap so that we can promote good software engineering practices and leverage the protection domains that RPC offers.
Part 1 of barrier synchronization covers my notes on the first couple types of synchronization barriers including the naive centralized barrier and the slightly more advanced tree barrier. This post is a continuation and covers the three other barriers: MCS barrier, tournament barrier , dissemination barrier.
Summary
In the MCS tree barrier, there are two separate data structures that must be maintained. The first data structure (a 4-ary tree, each node containing a maximum of four children) handling the arrival of the processes and the second data structure handling the signaling and waking up of all other processes. In a nutshell, each parent node holds pointers to their children’s structure, allowing the parent process to wake up the children once all other children have arrived.
The tournament barrier constructs a tree too and at each level are two processes competing against one another. These competitions, however, are fixed: the algorithm predetermines which process will advanced to the next round. The winners percolate up the tree and at the top most level, the final winner signals and wakes up the loser. This waking up of the loser happens at each lower level until all nodes are woken up.
The dissemination protocol reminds me of a gossip protocol. With this algorithm, all nodes detect convergence (i.e. all processes arrived) once every process receives a message from all other processes (this is the key take away); a process receives one (and only one) message per round. The runtime complexity of this algorithm is nlogn (coefficient of n because during each round n messages, one message sent from one node to its ordained neighbor).
The algorithms described thus far share a common requirement: they all require sense reversal.
MCS Tree Barrier (Binary Wakeup)
MCS Tree barrier with its “has child” vector
Summary
Okay, I think I understand what’s going on. There are two separate data structures that need to be maintained for the MCS tree barrier. The first data structure handles the arrival (this is the 4-ary tree) and the second (binary tree) handles the signaling and waking up of all the other processes. The reason why the latter works so well is that by design, we know the position of each of the nodes and each parent contains a pointer to their children, allowing them to easily signal the wake up.
Tournament Barrier
Tournament Barrier – fixed competitions. Winner holds the responsibility to wake up the losers
Summary
Construct a tree and at the lowest level are all the nodes (i.e. processors) and each processor competes with one another, although the round is fixed, fixed in the sense that the winner is predetermined. Spin location is statically determined at every level
Tournament Barrier (Continued)
Summary
Two important aspects: arrival moves up the tree with match fixing. Then each winner is responsible for waking up the “losers”, traversing back down. Curious, what sort of data structure? I can see an array or a tree …
Tournament Barrier (Continued)
Summary
Lots of similarity with sense reversing tree algorithm
Dissemination Barrier
Dissemination Barrier – gossip like protocol
Summary
Ordered communication: like a well orchestrated gossip like protocol. Each process will send a message to ordained peer during that “round”. But I’m curious, do we need multiple rounds?
Dissemination Barrier (continued)
Summary
Gossip in each round differs in the sense the ordained neighbor changes based off of Pi -> P(I + 2^k) mod n. Will probably need to read up on the paper to get a better understanding of the point of the rounds ..
Quiz: Barrier Completion
Summary
Key point here that I just figured out is this: every processor needs to hear from every other processor. So, it’s log2N with a ceiling since N rounds must not be a power of 2 (still not sure what that means exactly)
Dissemination Barrier (continued)
Summary
All barriers need sense reversal. Dissemination barrier is no exception. This barrier technique works for NCC and clusters.Every round has N messages. Communication complexity is nlogn (where N is number of messages) and log(n). Total communication nlogn because N messages must be sent every round, no exception
Performance Evaluation
Summary
Most important question to ask when choosing and evaluating performance is: what is the trend? Not exact numbers, but trends.
As mentioned previously, there are different types of synchronization primitives that us operating system designers offer. If as an application designer you nee to ensure only one thread can access a piece of shared memory at a time, use a mutual exclusion synchronization primitive. But what about a different scenario in which you need all threads to reach a certain point in the code and only once all threads reach that point do they continue? That’s where a barrier synchronization comes into play.
This post covers two types of barrier synchronizations. The first is the naive, centralized barrier and the second is the a tree barrier.
In a centralized barrier, we basically have a global count variable and as each thread enters the barrier, they decrement the shared count variable. After decrementing the count, threads will hit a predicate and branch: if the count is not zero, then the thread enters a busy spin loop, spinning while the count is greater than zero. However, if after decrementing the counter equals zero, then that means all threads have arrived at the end of the barrier synchronization.
Simple enough, right? Yes it is, but the devil is in the details because there’s a subtle bug, a subtle edge case. It is entirely possible (based off of the code snippet below) that when the last thread enters the barrier and decrements the count, all the other threads suddenly move beyond the barrier (since the count is not greater than zero). In other words, the last thread never gets to reset the count back to N number of threads.
How to avoid this problem? Simple: add another while loop that guarantees that the threads do not leave the barrier until the counter gets reset. Very elegant. Very simple.
The next type of barrier is a tree barrier. The tree barrier groups multiple process together at multiple levels (number of levels is logn where n is the number of processors), each group maintaining its own count and local sense variables. The benefit? Each group spins on its own locksense. Downside? The spin location is dynamic, not static and can impede performance on NUMA architectures.
Centralized Barrier
Centralized Barrier
Summary
Centralized barrier synchronization is pretty simple: keep a counter that decrements as each thread reaches the barrier. Every thread/process will spin until the last thread arrives, at which point the last thread will reset the barrier counter so that it can be used later on
Problems with Algorithm
Summary
Race condition: last thread, while updating the counter, all other threads move forward
Counting Barrier
Summary
Such a simple and elegant solution by adding a second spin loop (still inefficient, but neat nonetheless). Sense reverse barrier algorithm
Sense Reversing Barrier
Sense Reversing Barrier
Summary
One way to optimize the centralized barrier is to introduce a sense reversing barrier. Essentially, each process maintains its own unique local “sense” that flips from 0 to 1 (or 1 to 0) each time synchronization barrier is needed. This local variable is compared against a shared flag and only when the two are equal can all the threads/processes proceed past the current barrier and move on to the next
Tree Barrier
Tree Barrier
Summary
Group processes (or threads) and each group has its own shared variables (count and lock sense). Before flipping the lock sense, the final process needs to move “up to the next level” and check if all other processors have arrived at the next level. Things are getting a little more spicy and complicated with this type of barrier
Tree Barrier (Continued)
Summary
With a tree barrier, a process arrives at its group (of count and lock sense), and will decrement the count variable and will then check the lock sense variable. If lock sense is not equal, then spin. If last
Tree Barrier (continued)
Summary
Once the last process reaches the root, it’s their responsibility to begin waking up the lower levels, traversing back down the tree. At each level, they will be flipping the lock sense flag
Tree Barrier (Continued)
Summary
As always, there’s a trade off or hidden downside with this implementation. First, the spin location is not statically determined. This dynamic allocation may be problematic, especially on NUMA (non uniformed memory access architecture) architecture, because a process may be spinning on a remote memory location. But my question is, are there any systems that do not offer coherence?
What’s the deal with a sense reversing barrier? Even after watching the lectures on the topic, I was still confused as to how a single flag could toggle between two values (true and false) can communicate whether or not all processes (or threads) are running in the same critical section. This concept completely baffled me.
Trying to make sense of it all, I did some “research” (aka google search) and landed on a great article titled Implementing Barriers1. The article contains source code written in C that helped demystify the topic.
Sense Reversal Barrier. Credit: (Nkindberg 2013)
The article explains how each process must maintain its own local sense variable and compare that variable against a shared flag. That was the key, the fact that each process maintains its own local variable, separate from the shared flag. That means, each time a process enters a barrier, the local sense toggles, switching from 1 to 0 (or from 0 to 1), depending on the variable’s last value. If the local sense variable and the shared flag differ, then the process (or thread) must wait before proceeding beyond the current barrier.
For example, let’s say process 0 initializes its local sense variable of 0. The process enters the barrier, flipping the local sense from 0 to 1. Once the process reaches the end of the critical section, the process compares the shared flag (also initialized to 0) with its local sense variable and since the two do not equal to one another, then that process waits until all other processes complete.
In part 1 of synchronization, I talked about the more naive spin locks and other naive approaches that offer only marginally better performance by adding delays or reading cached caches (i.e. avoiding bus traffic) and so on. Of all the locks discussed thus far, the array based queuing lock offers low latency, low contention and low waiting time. And best of all, the this lock offers fairness. However, there’s no free lunch and what we trade off performance for memory: each lock contains an array with N entries where N is the number of physical CPUs. That high number of CPUs may be wasteful (i.e. when not all processors contend for a lock).
That’s where linked based queuing locks come in. Instead of upfront allocating a contiguous block of memory reserved for each CPU competing for lock, the linked based queuing lock will dynamically allocate a node on the heap and maintain a linked list. That way, we’ll only allocate the structure as requests trickle in. Of course, we’ll need to be extremely careful with maintaining the linked list structure and ensure that we are atomically updating the list using instructions such as fetch_and_store during the lock operation.
Most importantly, we (as the OS designer that implement this lock) need to carefully handle the edge cases, especially while removing the “latest” node (i.e. setting it’s next to NULL) while another node gets allocated. To overcome this classic race condition, we need to atomically rely on a new (to me at least) atomic operation: compare_and_swap. This instruction will be called like this: compare_and_swap(latest_node, me, nil). If the latest node’s next pointer matches me, then set the next pointer to NIL. Otherwise, we know that there’s a new node in flight and will have to handle the edge case.
Anyways, check out the last paragraph on this page if you want a nice matrix that compares all the locks discussed so far.
Linked Based Queuing Lock
Summary
Similar approach to Anderson’s array based queueing lock, but using a linked list (authors are MCS: mellor-crummey J.M. Scott). Basically, each lock has a qnode associated with it, the structure containing a got_it and next. When I obtain the lock, I point the qnode towards me and can proceed into the critical section
Link Based Queuing Lock (continued)
Summary
The lock operation (i.e. Lock(L, me)) requires a double atomic operation, and to achieve this, we’ll use a fetch_and_store operation, an instruction that retrieves the previous address and then stores mine
Linked Based Queuing Lock
Summary
When calling unlock, code will remove current node from list and signal successor, achieving similar behavior as array based queue lock. But still an edge case remains: new request is forming while current thread is setting head to NIL
Linked Based Queuing Lock (continued)
Summary
Learned about a new primitive atomic operation called compare and swap, a useful instruction for handling the edge of updating the dummy node’s next pointer when a new request is forming
Linked Based Queuing Lock (continued)
Summary
Hell yea. Correctly anticipated the answer when professor asked: what will the thread spin on if compare_and_swap fails?
Linked Based Queuing Lock (continued)
Summary
Take a look at the papers to clarify the synchronization algorithm on a multi-processor. Lots of primitive required like fetch and store, compare and swap. If these hardware primitives are not available, we’ll have to implement them
Linked Based Queuing Lock (continued)
Summary
Pros and Cons with linked based approach. Space complexity is bound by the number of dynamic requests but downside is maintenance overhead of linked list. Also another downside potentially is if no underlying primitives for compare_and_swap etc.
Algorithm grading Quiz
Comparison between various spin locks
Summary
Amount of contention is low, then use spin w delay plus exponential backoff. If it’s highly contended, then use statically assigned.
I broke down the synchronization topic into two parts and this will cover material up to and including the array based queuing lock. I’ll follow up with part 2 tomorrow, which will include the linked list based queuing lock.
There a couple different types of synchronization primitives: mutual exclusion and barrier synchronization. Mutual exclusion ensures that only one thread (or process) at a time can write to a shared data structure, whereas barrier synchronization ensures that all threads (or processes) reach a certain point in the code before continuing. The rest of this summary focuses on the different ways we as operating system designers can implement mutual exclusion and offer it as a primitive to system programmers.
To implement a mutual exclusion lock, three instructions need to be bundled together: reading, checking, writing. These three operations must happen atomically, all together and at once. To this end, we need to leverage the underlying hardware’s instructions such as fetch_and_set, fetch_and_increment, or fetch_and_phi.
Whatever way we implement the lock, we must evaluate our implementation with three performance characteristics:
Latency (how long to acquire the lock)
Waiting time (how long must a single thread must wait before acquiring the lock)
Contention (how long long to acquire lock when multiple thread compete)
In this section, I’ll list each of the implementations ordered in most basic to most advanced
Naive Spin Lock – a simple while loop that calls test_and_set atomic operation. Downside? Lots of bus traffic for cache invalidation or cache updating. What can we do instead?
Caching Spin Lock – Almost identical to the previous implementation but instead of spinning on a while loop, we spin on a read, followed by a check for test_and_set. This reduces noisy bus traffic of test_and_test (which again, bypasses cache and writes to memory, every time)
Spinlock with delay – Adds a delay to each test_and_test. This reduces spurious checks and ensures that not all threads perform test_and_set at the same time
Ticket Lock – Each new request that arrives obtains a ticket. This methodology is analogous to a how a deli hands out tickets for their customers; it is fair but still signals all other threads to wake up
Array based queuing lock – An array that is N in size (where N is the number of processors). Each entry in the array can be in one of two states: has lock and must-wait. Fair and efficient. But trades off space: each lock gets its own array with N entries where N is number of processors. Can be expensive for machines with thousands of processors and potentially wasteful if not all of them contend for the lock
Lesson Summary
Summary
There are two types of locks: exclusive and shared lock. Exclusive lock guarantees that one and only one thread can access/write data to a location. Shared allows multiple readers, but guaranteeing no other thread is writing at that time.
Synchronization Primitives
Summary
Another type of synchronization primitive (instead of mutual exclusion) is a barrier. A barrier basically ensures that all threads reach a certain execution point and only then can the threads advance. Similar to real life when you get to a restaurant and you cannot be seated until all your party arrives. That’s an example of a barrier.
Programmer’s Intent Quiz
Summary
We can write that enforces barrier with a simple while loop (but its inefficient) using atomic reads and writes
Programmer’s Intent Explanation
Summary
Programmer’s Intent Quiz
Because of the atomic reads and writers offered by the instruction set, we can easily use a flag to coordinator between processes. But are these instructions sufficient for creating a mutual exclusion lock primitive?
Atomic Operations
Summary
The instruction set ensures that each read and write operation is atomic. But if we need to implement a lock, there are three instructions (read, check, write) that need to be bundled together into a single instruction by the hardware. This grouping can be done in one of three ways: test_and_set or (generically) fetch_and_increment or fetch_and_phi
Scalability Issues with Synchronization
Summary
What are the three types of performance issues with implementing a lock? Which of the two are under the control of the OS designer? The three scalability issues are as follows: latency (time to acquire lock) and waiting time (how long does a thread wait until it acquires the lock) and contention (how long does it take, when all threads are competing to acquire lock, for lock to be given to a single thread). The first and third (i.e. latency and contention) are under the control of the OS designer. The second is not: waiting time is largely driven by the application itself because the critical section might be very long.
Naive Spinlock (Spin on T+S)
Summary
Why is the SPIN lock is considered naive? How does it work? Although the professor did not explicitly call out why the SPIN lock is naive, I can guess why (and my guess will be confirmed in the next slide, during the quiz). Basically, each thread is eating up CPU cycles by performing the test and test. Why not just signal them instead? That would be more efficient. But, as usual, nothing is free so there must be a trade off in terms of complexity or cost. But returning back to how the naive scheduler works, it’s basically a while loop with the test_and_test (semantically guaranteeing on a single thread will receive the lock).
Problems with Naive Spin Lock
Problems with naive spinlock
Summary
Spin lock is naive for three reasons: too much contention (when lock is released, all threads, maybe thousands, will jump at the opportunity) and does not exploit cache (test and set by nature must bypass cache and go straight to memory) and disrupts useful work (only one process can make forward progress)
Caching Spinlock
Caching Spin lock (spin on read)
Summary
To take advantage of the cache, processors will first call while(L == locked), taking advantage of the cache. Once test_and_set occurs, all processors will then proceed with test_and_set. Can we do better than this?
Spinlocks with Delay
Summary
Any sort of delay will improve performance of a spin lock when compared to the naive solution
Ticket Lock
Summary
A fair algorithm that uses a deli or restaurant analogy: people who come before will get lock before people who come after. This is great for fairness, but performance still lacking in the sense that when a thread releases its lock, an update will be sent across the entire bus. I wonder: can this even be avoided or its the reality of the situation?
Spinlock Summary
Array-based queuing lock
Summary
A couple differentiations toptions 1) Read and Test and Test (no fairness) 2) Test and Set with Delay (no fairness) 3) Ticket Lock (fair but noisy). How can we do better? How can we only signal one and only thread to wake up and attempt to gain access to the lock? Apparently, I’ll learn this shortly, with queueing locks. This lesson reminds me of a conversation I had with my friend Martin around avoiding a busy while() loop for lock contention, a conversation we had maybe about a year or so ago (maybe even two years ago)
Array Based Queuing Lock
Summary
Create an array that is N in size (where N is the number of processors). Each entry in the array can be in one of two states: has lock and must-wait. Only one entry can ever be in has-lock state. I’m proud that I can understand the circular queue (just by looking at the array) with my understanding of the mod operator. But even so, I’m having a hard time understanding how we’ll use this data structure in real life.
Array Based Queuing Lock (continued)
Summary
Lock acquisition looks like: fetch and increment (queue last) followed by a while(flags[mypalce mod N] == must wait). Still not sure how this avoids generating cache bus traffic as described earlier…
Array-based queue lock
Summary
Array-basd queuing lock (mod)
Using mod, we basically update the “next” entry in the array and set them to “has_lock”
You need to take away the following two themes for shared memory machine model:
Difference and relationship between cache coherence (dealt with in hardware) and cache consistency (handled by the operating system)
The different memory machine models (e.g. dance hall, symmetric multiprocessing, and distributed shared memory architecture)
Cache coherence is the promise delivered by the underlying hardware architecture. The hardware guarantees employs one of two techniques: write invalidate or write update. In the latter, when a memory address gets updated by one of the cores, the system will send a message on the bus to invalidate the cache entry stored in all the other private caches. In the former, the system will instead update all the private caches with the correct data. Regardless, the mechanism in which cache is maintained is an implementation detail that’s not privy to the operating system.
Although the OS has not insight into how the hardware delivers cache coherence, the OS does rely on the cache coherence to build cache consistency, the hardware and software working in harmony.
Shared Memory Machine Model
Summary
Shared memory model – Dance Hall ArchitectureSymmetric multiprocessing – each processor’s time to access to the memory is the sameShared Memory Machine Model – each CPU has its own private cache and its own memory, although they can access each others addresses
There are three memory architecture: dance hall (each CPU has its own memory), SMP (from perspective of each CPU, access to memory is the same as other CPU), and Distributed shared memory architecture (some cache and some memory is faster to access for a given CPU). Lecture doesn’t go much deeper than this but I’m super curious as to the distributed architecture.
Shared Memory and Caches
Summary
Because each CPU has its own private cache, we may run into a problem known as cache coherence. This situation can occur if the caches tied to each CPU contain different values for the same memory address. To resolve this, we need a memory consistency model.
Quiz: Processes
Summary
Quiz tests us by offering two processes, each using shared memory, each updating different variables. Then the quiz asks us what are the possible values for the variables. Apart from the last one, all are possible, the last one would break the intuition of a memory consistency model (discussed next)
Memory consistency model
Summary
Introduces the sequential consistency memory model, a model that exhibits two characteristics: program order and arbitrary interleaving. This Isi analogous to someone shuffling a deck of cards.
Sequential Consistency and cache coherence (Quiz)
Summary
Cache Coherence Quiz – Sequential Consistency
Which of the following are possible values for the following instructions
Hardware Cache Coherence
Summary
Hardware Cache Coherence
There are two strategies for maintaining cache coherence. Write invalidate and write update. In the former, the system bus will broadcast an invalidate message if one of the other cache’s modifies an address in its private cache. In the latter, the system must will send an update message to each of the cache’s, each cache updating it’s private cache with the correct data. Obviously, the lecture oversimplifies the intricacies of maintaining a coherent cache (and if you want to learn more, check out the high performance computing architecture lectures — or maybe future modules in this course will cover this in more detail)
Scalability
Summary
Scalability – expectation with more processors. Pros: Exploit Parallelism. Cons : Increased overhead
Adding processors increase parallelism and improves performance. However, performance will decrease due to additional overhead of maintaining the bus (another example of making trade offs and how nothing is free).